ding-young opened a new issue, #16367: URL: https://github.com/apache/datafusion/issues/16367
### Is your feature request related to a problem or challenge? Part of https://github.com/apache/datafusion/issues/16065 , Related to https://github.com/apache/datafusion/issues/14078 ### Background PR https://github.com/apache/datafusion/pull/16268 will introduce compression option when writing spill files in disk. These options are limited to compression options to what Arrow IPC Stream Writer supports : `zstd`, `lz4_frame` (compression level is not configurable) To further enhance spilling execution, we need to investigate - CPU / I/O tradeoff when `zstd` or `lz4_frame` compression is enabled i.e. compression ratio, extra latency spent for compression - Current arrow ipc stream writer always write `batch` at a time in `append_batch`. In terms of compression, it is not sure yet how much single batch can benefit from compression. - whether we need separate `Writer` or `Reader` implementation instead of IPC Stream Writer. - how to introduce sort of `adaptiveness`. ### Describe the solution you'd like First, we need to track (or update) how many bytes are written in spill files. Datafusion currently tracks `spilled_bytes` as part of `SpillMetrics`, but it is calculated based on in memory array size, which would be different from actual spill files size especially when we compress spill files. Second, update the benchmarks or write a separate benchmarks to see the performance characteristics. One possible way is writing out spill-related metrics to output.json when running benches like tpch with `debug` option. Another idea is to generate some spill files for microbenchmark testing only spill writing - reading process. ### Describe alternatives you've considered _No response_ ### Additional context _No response_ -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org