alamb opened a new issue, #17089:
URL: https://github.com/apache/datafusion/issues/17089

   ### Is your feature request related to a problem or challenge?
   
   - part of https://github.com/apache/datafusion/issues/14836
   
   As part of trying to help people understand how to tune memory related 
settings in https://github.com/apache/datafusion/pull/17069, @2010YOUY01  
pointed out 
https://github.com/apache/datafusion/pull/17069#issuecomment-3167062577
   
   > Let's make it concise now, I think adding a few more sentences of 
explanation might actually confuse those without the background knowledge. A 
tutorial-style doc is still needed to describe the full picture. 
   
   I agree and I think a blog post explaining how DataFusion processes larger 
than memory data sets would be really helpful to give people the broader context
   
   ### Describe the solution you'd like
   
   I suggest a blog on https://datafusion.apache.org/blog/ (make a PR to 
https://github.com/alamb/datafusion-site) 
   
   
   
   ### Describe alternatives you've considered
   
   Some ideas
   
   1. **Background** on memory usage in DataFusion (the memory manager model 
and what takes large amounts of memory) -- can probably copy/paste from 
https://docs.rs/datafusion/latest/datafusion/execution/memory_pool/trait.MemoryPool.html#memory-management-overview
   2. **Memory Consumers** Describe at a high level what consumes most memory 
in a plan (group by hash table, sorting, etc)
   3. **Optimizations** that DataFusion does to try and avoid needing memory 
(e.g. take advantage of pre-xisting sort orders, topk, etc)
   4. **Spilling Sort**: Provide an overview of the main spilling sort 
algorithm (sort in memory, spill to disk, merge pre-sorted runs, etc)
   5. **Using Spilling Sort in Group By**: Explain that the grouping operation 
uses the same underlying building block
   6. **Call for help**: 🎣 for people to figure out how to add spilling for 
joins and make the spilling hash faster
   
   
   ### Additional context
   
   I am very happy to help write this


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to