alamb opened a new issue, #17089: URL: https://github.com/apache/datafusion/issues/17089
### Is your feature request related to a problem or challenge? - part of https://github.com/apache/datafusion/issues/14836 As part of trying to help people understand how to tune memory related settings in https://github.com/apache/datafusion/pull/17069, @2010YOUY01 pointed out https://github.com/apache/datafusion/pull/17069#issuecomment-3167062577 > Let's make it concise now, I think adding a few more sentences of explanation might actually confuse those without the background knowledge. A tutorial-style doc is still needed to describe the full picture. I agree and I think a blog post explaining how DataFusion processes larger than memory data sets would be really helpful to give people the broader context ### Describe the solution you'd like I suggest a blog on https://datafusion.apache.org/blog/ (make a PR to https://github.com/alamb/datafusion-site) ### Describe alternatives you've considered Some ideas 1. **Background** on memory usage in DataFusion (the memory manager model and what takes large amounts of memory) -- can probably copy/paste from https://docs.rs/datafusion/latest/datafusion/execution/memory_pool/trait.MemoryPool.html#memory-management-overview 2. **Memory Consumers** Describe at a high level what consumes most memory in a plan (group by hash table, sorting, etc) 3. **Optimizations** that DataFusion does to try and avoid needing memory (e.g. take advantage of pre-xisting sort orders, topk, etc) 4. **Spilling Sort**: Provide an overview of the main spilling sort algorithm (sort in memory, spill to disk, merge pre-sorted runs, etc) 5. **Using Spilling Sort in Group By**: Explain that the grouping operation uses the same underlying building block 6. **Call for help**: 🎣 for people to figure out how to add spilling for joins and make the spilling hash faster ### Additional context I am very happy to help write this -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org