Alexey Kudinkin created HUDI-2928:
-------------------------------------
Summary: Evaluate rebasing Hudi's default compression from Gzip to
Snappy
Key: HUDI-2928
URL: https://issues.apache.org/jira/browse/HUDI-2928
Project: Apache Hudi
Issue Type: Task
Reporter: Alexey Kudinkin
Currently, having Gzip as a default we prioritize Compression/Storage cost at
the expense of
* Compute (on the {+}write-path{+}): about *30%* of Compute burned during
bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below)
* Compute (on the {+}read-path{+}), as well as queries Latencies: queries
scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put is
*3-4x* less than Snappy's, [EX|https://stackoverflow.com/a/56410326/3520840])
P.S Spark switched its default compression algorithm to Snappy [a while
ago|https://github.com/apache/spark/pull/12256].
--
This message was sent by Atlassian Jira
(v8.20.1#820001)