2010YOUY01 commented on issue #14078: URL: https://github.com/apache/datafusion/issues/14078#issuecomment-2804443324
https://www.cs.cit.tum.de/fileadmin/w00cfj/dis/papers/umami.pdf Here are some advanced German techniques: this paper discussed 1. Implementation of a hash-based spilling operator 2. Its spill file format. I think the spill format-related optimization is applicable. Here is a TLDR of the idea. ## Tiered encoding/compression The optimized spill format supports several encoding/compression schemes with different CPU/IO tradeoffs. On one extreme of the spectrum, we get very little CPU overhead but worse compression ratio; while on the other extreme, we get compression implementations with better compression ratio but higher CPU overhead. For example: ```text tier0 - Plain encoding (fast to encode, larger file size) tier1 - Plain encoding + LZ4 (medium encoding speed, medium compression ratio) tier2 - REE/Dictionary encoding + ZSTD (slow encoding speed, high compression ratio) ``` ## Spilling operator integration The configuration for the spilling format supports `auto` mode in addition to all available tiers. In this case, the operator will use the default scheme for the first couple of batches and collect related metrics (compute time, IO time): - When compute time >> IO time (e.g., running on a machine with fast SSDs) - CPU-bound: choose light encoding scheme (tier0) - When compute time << IO time (e.g., running on a machine with spinning disks or with noisy neighbors that consume lots of IO bandwidth) - IO-bound: choose heavier encoding scheme (tier2) The paper refers to it as `self-regulating compression`, and we don't have to manually set this option in average cases. The tricky part to implement is array encoding like REE or bit-packing for integer arrays. Maybe we can find some reusable code in Arrow Parquet writer implementation or use something like https://github.com/spiraldb/vortex. But it's okay to start without those encodings. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org