dongjoon-hyun commented on pull request #31618:
URL: https://github.com/apache/spark/pull/31618#issuecomment-784522469
Hi, @HyukjinKwon . Why do you think so?
> I think it's not an obvious win though .. Zstd looks more for archiving
purpose with less throughput with high compression ratio vs lz4 is for more
throughput with less compression.
According to the benchmark,
- LZ4 1.7.5 compression time is not a winner. If you consider the upload
time to the remote storage, ZSTD can be the winner.
- LZ4 1.7.5 decompression time might be your reason. However, this is an
event log.
- When you download a log from `Spark History Server`, ZSTD log file will
be downloaded 2~3x faster.
- Also, when you view the log via `Spark History Server`, Spark History
Server also do the download it from the remote storage like S3 and decompress
it. 2~3x faster download will compensate the decompression downgrade slowdown.
In addition, for the storage cost saving, ZSTD is a clear winner.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]