hudi-bot opened a new issue, #14938:
URL: https://github.com/apache/hudi/issues/14938
Currently, having Gzip as a default we prioritize Compression/Storage cost
at the expense of
* Compute (on the {+}write-path{+}): about *30%* of Compute burned during
bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below)
* Compute (on the {+}read-path{+}), as well as queries Latencies: queries
scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put is
*3-4x* less than Snappy, Zstd,
[EX|https://stackoverflow.com/a/56410326/3520840])
P.S Spark switched its default compression algorithm to Snappy [a while
ago|https://github.com/apache/spark/pull/12256].
*EDIT*
We should actually evaluate putting in
[zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/]
instead of Snappy. It has compression ratios comparable to Gzip, while
bringing in much better performance:
!image-2021-12-03-13-13-02-892.png!
[https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/]
## JIRA info
- Link: https://issues.apache.org/jira/browse/HUDI-2928
- Type: Improvement
- Epic: https://issues.apache.org/jira/browse/HUDI-3249
- Attachment(s):
- 03/Dec/21 21:03;alexey.kudinkin;Screen Shot 2021-12-03 at 12.36.13
PM.png;https://issues.apache.org/jira/secure/attachment/13036992/Screen+Shot+2021-12-03+at+12.36.13+PM.png
- 06/Dec/21 19:49;alexey.kudinkin;Screen Shot 2021-12-06 at 11.49.05
AM.png;https://issues.apache.org/jira/secure/attachment/13037052/Screen+Shot+2021-12-06+at+11.49.05+AM.png
- 03/Dec/21
21:13;alexey.kudinkin;image-2021-12-03-13-13-02-892.png;https://issues.apache.org/jira/secure/attachment/13036993/image-2021-12-03-13-13-02-892.png
---
## Comments
03/Dec/21 21:03;alexey.kudinkin;!Screen Shot 2021-12-03 at 12.36.13
PM.png!;;;
---
06/Dec/21 19:50;alexey.kudinkin;Running a benchmark upon small subset of the
Amazon Reviews dataset we're able to see considerable improvement in
bulk-insert times: bulk-insert was up to *40%* faster, while it had very
similar footprint in the storage.
!Screen Shot 2021-12-06 at 11.49.05 AM.png|width=935,height=644!;;;
---
14/Dec/21 01:06;alexey.kudinkin;Unfortunately, the switching to Zstd might
required a little more grinding than initially anticipated:
Current Parquet version (1.10.1, being handed down by Spark 2.4.4) only
supports `ZstdCompressionCodec` as provided by "hadoop-common", which in turn
requires it to be built with Native Libraries support (including compression
codecs, etc). It only supports Linux/*nix.
Therefore if we're planning on supporting Spark 2.x we have following
options:
# Implement our own version of `ZstdCompressionCodec` leveraging either
[zstd-jni|https://github.com/luben/zstd-jni] (used by Spark internally) or
airlift-aircompressor (claims to be faster than JNI impl).
# Switch to `zstd` being default setting only for Spark 3 environments.
;;;
---
12/Jan/22 00:45;alexey.kudinkin;Unfortunately we won't be able to support
Zstd w/o herculean effort of hacking around Parquet implementation as it's not
unfortunately modularized well-enough to support outside extensions.
The only sensible way at this point seem to be waiting for Spark/Parquet
upgrade to 1.12.;;;
---
03/Feb/22 17:29;alexey.kudinkin;Uber's example of leveraging Zstd in lieu of
Gzip
https://eng.uber.com/cost-efficiency-big-data/;;;
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]