Re: [I] [Epic]: Google Summer of Code 2025 Improving Spilling Execution [datafusion]

via GitHub Mon, 19 May 2025 10:24:25 -0700


alamb commented on issue #16065:
URL: https://github.com/apache/datafusion/issues/16065#issuecomment-2891762929


   
   > Stabilize external sort and aggregate.
   
   In my opinion, I suggest starting and finishing with external sort -- having 
a robust and performance external sort can be a key building block for other 
algorithms (like sort-merge-join and potentially aggregation)
   
   So in my mind "robust and performant" means:
   1. It is possible to sort any sized data, subject to temp space requirements 
   2. The time to sort data once spilling continues to grow as `O(N*ln(N))` as 
the size of the data increases
   
   I would suggest holding off on Aggregation until @Rachelint  has completed 
designing the blocked approach to memory management (e.g. 
https://github.com/apache/datafusion/pull/15591) as I think that blocked 
approach will be directly related to spilling (aka spilling individual blocks, 
etc)
   
   > Implement a memory-limited nested loop join. Some non-equality joins may 
only be supported by NLJ.
   
   Instead of a new join algorithm, I suggest you look into the existing 
MergeJoin that @comphead and others have invested much effort in optimizing.
   
   The basic idea is that for any buffering join (like NLJ or HJ) if memory is 
exhausted, switch at runtime to Sort-Merge-Join (that is sort the in memory 
buffer, sort the remaining other input, and then use sort merge join to merge)
   
   While this will not be as fast as reusing the hash buffers, I think it will 
be far easier to implement (especially after you have a robust implementation 
of spilling sort)
   
   > Optimize the spill format, likely building on top of Arrow's IPC stream 
reader/writer.
   (And also improve UX/performance along the way)
   > Are there any other tasks worth exploring? I'm not very familiar with 
Arrow IPC internal, are there any stream reader/writer–related tasks we could 
also consider? [@alamb](https://github.com/alamb)
   
   Some other potential things to consider if we don't already is:
   1. permit using compression
   2. investigate writing/reading runs with `mmap` to avoid copying the data 
again
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] [Epic]: Google Summer of Code 2025 Improving Spilling Execution [datafusion]

Reply via email to