pan3793 commented on PR #43338:
URL: https://github.com/apache/spark/pull/43338#issuecomment-1760915017

   > > > The RocksDB Team recommend LZ4 or ZSTD ...
   > > 
   > > 
   > > Why choose lz4 instead of zstd? I suppose zstd is a more future-proofing 
algorithm
   > 
   > ZSTD has good compression ratio but is slower. LZ4 is the fast one with 
worse compression ratio (which is similar to Snappy). For Spark Structured 
Streaming, CPU is more of a bottleneck, rather than I/O or space, so LZ4 is a 
better choice.
   
   Thanks for the explanation, I do agree that lz4 consumes a little less CPU 
than zstd (even with the level 1), but the CPU/compression ratio is also 
related to the data content.
   
   > Can we make it configurable?
   
   Big agree, make it configurable so that users can switch to high 
speed/compression ratio algorithms based on their hardware setup (e.g. we have 
some clusters with small space SSD on the compute node, and it's likely to 
offload gzip/zstd codec from the CPU to FPGA or dedicated hardware in the 
future)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to