[jira] [Comment Edited] (HUDI-2928) Evaluate rebasing Hudi's default compression from Gzip to Zstd

Alexey Kudinkin (Jira) Mon, 06 Dec 2021 11:58:06 -0800


    [ 
https://issues.apache.org/jira/browse/HUDI-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17454213#comment-17454213
 ]


Alexey Kudinkin edited comment on HUDI-2928 at 12/6/21, 7:57 PM:
-----------------------------------------------------------------

Running a benchmark upon small subset of the Amazon Reviews dataset we're able 
to see considerable improvement in bulk-insert times: bulk-insert was up to 
*40%* faster, while it had very similar footprint in the storage.

!Screen Shot 2021-12-06 at 11.49.05 AM.png|width=935,height=644!


was (Author: alexey.kudinkin):
Running a benchmark upon small subset of the Amazon Reviews dataset we're able 
to see considerable improvement in bulk-insert times: up to bulk-insert was up 
to *40%* faster, while it had very similar footprint in the storage.

!Screen Shot 2021-12-06 at 11.49.05 AM.png|width=935,height=644!

> Evaluate rebasing Hudi's default compression from Gzip to Zstd
> --------------------------------------------------------------
>
>                 Key: HUDI-2928
>                 URL: https://issues.apache.org/jira/browse/HUDI-2928
>             Project: Apache Hudi
>          Issue Type: Task
>            Reporter: Alexey Kudinkin
>            Assignee: Alexey Kudinkin
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 0.11.0
>
>         Attachments: Screen Shot 2021-12-03 at 12.36.13 PM.png, Screen Shot 
> 2021-12-06 at 11.49.05 AM.png, image-2021-12-03-13-13-02-892.png
>
>
> Currently, having Gzip as a default we prioritize Compression/Storage cost at 
> the expense of
>  * Compute (on the {+}write-path{+}): about *30%* of Compute burned during 
> bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) 
>  * Compute (on the {+}read-path{+}), as well as queries Latencies: queries 
> scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put 
> is *3-4x* less than Snappy, Zstd, 
> [EX|https://stackoverflow.com/a/56410326/3520840])
> P.S Spark switched its default compression algorithm to Snappy [a while 
> ago|https://github.com/apache/spark/pull/12256].
>  
> *EDIT*
> We should actually evaluate putting in 
> [zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/]
>  instead of Snappy. It has compression ratios comparable to Gzip, while 
> bringing in much better performance:
> !image-2021-12-03-13-13-02-892.png!
> [https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/]
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (HUDI-2928) Evaluate rebasing Hudi's default compression from Gzip to Zstd

Reply via email to