[
https://issues.apache.org/jira/browse/HUDI-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17473227#comment-17473227
]
Alexey Kudinkin commented on HUDI-2928:
---------------------------------------
Unfortunately we won't be able to support Zstd w/o herculean effort of hacking
around Parquet implementation as it's not unfortunately modularized well-enough
to support outside extensions.
The only sensible way at this point seem to be waiting for Spark/Parquet
upgrade to 1.12.
> Evaluate rebasing Hudi's default compression from Gzip to Zstd
> --------------------------------------------------------------
>
> Key: HUDI-2928
> URL: https://issues.apache.org/jira/browse/HUDI-2928
> Project: Apache Hudi
> Issue Type: Task
> Reporter: Alexey Kudinkin
> Assignee: Alexey Kudinkin
> Priority: Blocker
> Labels: pull-request-available
> Fix For: 0.11.0
>
> Attachments: Screen Shot 2021-12-03 at 12.36.13 PM.png, Screen Shot
> 2021-12-06 at 11.49.05 AM.png, image-2021-12-03-13-13-02-892.png
>
>
> Currently, having Gzip as a default we prioritize Compression/Storage cost at
> the expense of
> * Compute (on the {+}write-path{+}): about *30%* of Compute burned during
> bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below)
> * Compute (on the {+}read-path{+}), as well as queries Latencies: queries
> scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put
> is *3-4x* less than Snappy, Zstd,
> [EX|https://stackoverflow.com/a/56410326/3520840])
> P.S Spark switched its default compression algorithm to Snappy [a while
> ago|https://github.com/apache/spark/pull/12256].
>
> *EDIT*
> We should actually evaluate putting in
> [zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/]
> instead of Snappy. It has compression ratios comparable to Gzip, while
> bringing in much better performance:
> !image-2021-12-03-13-13-02-892.png!
> [https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/]
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.1#820001)