Hi Celeborn Devs,

I would like to start a discussion about a new idea for Celeborn: *Support
Remote Block Spilling for Compute Engines*.

Celeborn works very well for shuffle data, but compute engines like Spark
still use local disks for execution spills (for example during large sorts
or aggregations). This can be a problem on machines with limited local
storage.

The idea is to extend Celeborn so it can store these spill blocks remotely.
When an executor runs out of memory, instead of writing to local disk, it
would send the spilled data to Celeborn Workers.

*Main points of the proposal:*

   -

   New RPC messages: PushSpillData and ReleaseSpill
   -

   Worker-side SpillFileManager to manage spilled blocks
   -

   Single-copy storage since the data is temporary
   -

   Reuse Celeborn’s existing network and storage components

I think this could make Celeborn more useful for cloud native and diskless
environments.

Happy to share a more detailed design document if there is interest.

Best regards,
Liam Hecht

Reply via email to