Re: [DISCUSS] CIP - Support Remote Block Spilling for Compute Engines

rexxiong Thu, 19 Mar 2026 20:53:28 -0700

Hi Liam,

Thanks for sharing your idea about integrating Remote Block Spilling into
Celeborn.


I'm excited about the potential for Celeborn to manage not just shuffle
data, but also spills and caches.

In our internal production environment, we've experienced significant
benefits and stability with Celeborn supporting Spark shuffle on
Kubernetes.
However, we've also encountered issues with large jobs leading to disk
space constraints due to spills.
If Celeborn could manage spill data, it would be a great advantage for us.

We've thought about supporting Spark spill data ourselves and understand
the challenges involved.
A major problem is Spark's lack of abstraction for spills, which might need
substantial modifications to the engine code.

On the Celeborn client side, we'd prefer using InputStream/OutputStream
interfaces for spill support and ensuring that data order remains intact.
Aside from this, the management of spill data is crucial. Currently,
Celeborn has the concept of shuffle; introducing a new concept or mode to
support spills might be something to consider.

We're eagerly anticipating the implementation of this feature in Celeborn
and would appreciate any detailed design documents you might share.

Best regards,
Jiashu Xiong


Liam Hecht <[email protected]> 于2026年3月15日周日 17:02写道：

> Hi Celeborn Devs,
>
> I would like to start a discussion about a new idea for Celeborn: *Support
> Remote Block Spilling for Compute Engines*.
>
> Celeborn works very well for shuffle data, but compute engines like Spark
> still use local disks for execution spills (for example during large sorts
> or aggregations). This can be a problem on machines with limited local
> storage.
>
> The idea is to extend Celeborn so it can store these spill blocks remotely.
> When an executor runs out of memory, instead of writing to local disk, it
> would send the spilled data to Celeborn Workers.
>
> *Main points of the proposal:*
>
>    -
>
>    New RPC messages: PushSpillData and ReleaseSpill
>    -
>
>    Worker-side SpillFileManager to manage spilled blocks
>    -
>
>    Single-copy storage since the data is temporary
>    -
>
>    Reuse Celeborn’s existing network and storage components
>
> I think this could make Celeborn more useful for cloud native and diskless
> environments.
>
> Happy to share a more detailed design document if there is interest.
>
> Best regards,
> Liam Hecht
>

Re: [DISCUSS] CIP - Support Remote Block Spilling for Compute Engines

Reply via email to