Re: [I] Optimized spill file format [datafusion]

via GitHub Tue, 15 Apr 2025 02:36:55 -0700


2010YOUY01 commented on issue #14078:
URL: https://github.com/apache/datafusion/issues/14078#issuecomment-2804443324


   https://www.cs.cit.tum.de/fileadmin/w00cfj/dis/papers/umami.pdf
   Here are some advanced German techniques: this paper discussed  
   1. Implementation of a hash-based spilling operator  
   2. Its spill file format.  
   
   I think the spill format-related optimization is applicable. Here is a TLDR 
of the idea.
   
   ## Tiered encoding/compression
   
   The optimized spill format supports several encoding/compression schemes 
with different CPU/IO tradeoffs. On one extreme of the spectrum, we get very 
little CPU overhead but worse compression ratio; while on the other extreme, we 
get compression implementations with better compression ratio but higher CPU 
overhead. For example:
   
   ```text
   tier0 - Plain encoding (fast to encode, larger file size)  
   tier1 - Plain encoding + LZ4 (medium encoding speed, medium compression 
ratio)  
   tier2 - REE/Dictionary encoding + ZSTD (slow encoding speed, high 
compression ratio)
   ```
   
   ## Spilling operator integration
   
   The configuration for the spilling format supports `auto` mode in addition 
to all available tiers. In this case, the operator will use the default scheme 
for the first couple of batches and collect related metrics (compute time, IO 
time):
   
   - When compute time >> IO time (e.g., running on a machine with fast SSDs)  
       - CPU-bound: choose light encoding scheme (tier0)
   - When compute time << IO time (e.g., running on a machine with spinning 
disks or with noisy neighbors that consume lots of IO bandwidth)  
       - IO-bound: choose heavier encoding scheme (tier2)
   
   The paper refers to it as `self-regulating compression`, and we don't have 
to manually set this option in average cases.  
   
   The tricky part to implement is array encoding like REE or bit-packing for 
integer arrays. Maybe we can find some reusable code in Arrow Parquet writer 
implementation or use something like https://github.com/spiraldb/vortex. But 
it's okay to start without those encodings.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Re: [I] Optimized spill file format [datafusion]

Reply via email to