pan3793 commented on PR #43338: URL: https://github.com/apache/spark/pull/43338#issuecomment-1760915017
> > > The RocksDB Team recommend LZ4 or ZSTD ... > > > > > > Why choose lz4 instead of zstd? I suppose zstd is a more future-proofing algorithm > > ZSTD has good compression ratio but is slower. LZ4 is the fast one with worse compression ratio (which is similar to Snappy). For Spark Structured Streaming, CPU is more of a bottleneck, rather than I/O or space, so LZ4 is a better choice. Thanks for the explanation, I do agree that lz4 consumes a little less CPU than zstd (even with the level 1), but the CPU/compression ratio is also related to the data content. > Can we make it configurable? Big agree, make it configurable so that users can switch to high speed/compression ratio algorithms based on their hardware setup (e.g. we have some clusters with small space SSD on the compute node, and it's likely to offload gzip/zstd codec from the CPU to FPGA or dedicated hardware in the future) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
