[ 
https://issues.apache.org/jira/browse/HUDI-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexey Kudinkin reassigned HUDI-2928:
-------------------------------------

    Assignee: Alexey Kudinkin

> Evaluate rebasing Hudi's default compression from Gzip to Snappy
> ----------------------------------------------------------------
>
>                 Key: HUDI-2928
>                 URL: https://issues.apache.org/jira/browse/HUDI-2928
>             Project: Apache Hudi
>          Issue Type: Task
>            Reporter: Alexey Kudinkin
>            Assignee: Alexey Kudinkin
>            Priority: Major
>
> Currently, having Gzip as a default we prioritize Compression/Storage cost at 
> the expense of
>  * Compute (on the {+}write-path{+}): about *30%* of Compute burned during 
> bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) 
>  * Compute (on the {+}read-path{+}), as well as queries Latencies: queries 
> scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put 
> is *3-4x* less than Snappy's, 
> [EX|https://stackoverflow.com/a/56410326/3520840])
>  
> P.S Spark switched its default compression algorithm to Snappy [a while 
> ago|https://github.com/apache/spark/pull/12256].



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to