hudi-bot opened a new issue, #14938:
URL: https://github.com/apache/hudi/issues/14938

   Currently, having Gzip as a default we prioritize Compression/Storage cost 
at the expense of
    * Compute (on the {+}write-path{+}): about *30%* of Compute burned during 
bulk-insert in local benchmarks on Amazon Reviews dataset is Gzip (see below) 
    * Compute (on the {+}read-path{+}), as well as queries Latencies: queries 
scanning large datasets are likely to be compression-/CPU-bound (Gzip t/put is 
*3-4x* less than Snappy, Zstd, 
[EX|https://stackoverflow.com/a/56410326/3520840])
   
   P.S Spark switched its default compression algorithm to Snappy [a while 
ago|https://github.com/apache/spark/pull/12256].
   
    
   
   *EDIT*
   
   We should actually evaluate putting in 
[zstd|https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/]
 instead of Snappy. It has compression ratios comparable to Gzip, while 
bringing in much better performance:
   
   !image-2021-12-03-13-13-02-892.png!
   
   
[https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/]
   
    
   
    
   
    
   
   ## JIRA info
   
   - Link: https://issues.apache.org/jira/browse/HUDI-2928
   - Type: Improvement
   - Epic: https://issues.apache.org/jira/browse/HUDI-3249
   - Attachment(s):
     - 03/Dec/21 21:03;alexey.kudinkin;Screen Shot 2021-12-03 at 12.36.13 
PM.png;https://issues.apache.org/jira/secure/attachment/13036992/Screen+Shot+2021-12-03+at+12.36.13+PM.png
     - 06/Dec/21 19:49;alexey.kudinkin;Screen Shot 2021-12-06 at 11.49.05 
AM.png;https://issues.apache.org/jira/secure/attachment/13037052/Screen+Shot+2021-12-06+at+11.49.05+AM.png
     - 03/Dec/21 
21:13;alexey.kudinkin;image-2021-12-03-13-13-02-892.png;https://issues.apache.org/jira/secure/attachment/13036993/image-2021-12-03-13-13-02-892.png
   
   
   ---
   
   
   ## Comments
   
   03/Dec/21 21:03;alexey.kudinkin;!Screen Shot 2021-12-03 at 12.36.13 
PM.png!;;;
   
   ---
   
   06/Dec/21 19:50;alexey.kudinkin;Running a benchmark upon small subset of the 
Amazon Reviews dataset we're able to see considerable improvement in 
bulk-insert times: bulk-insert was up to *40%* faster, while it had very 
similar footprint in the storage.
   
   !Screen Shot 2021-12-06 at 11.49.05 AM.png|width=935,height=644!;;;
   
   ---
   
   14/Dec/21 01:06;alexey.kudinkin;Unfortunately, the switching to Zstd might 
required a little more grinding than initially anticipated:
   
   Current Parquet version (1.10.1, being handed down by Spark 2.4.4) only 
supports `ZstdCompressionCodec` as provided by "hadoop-common", which in turn 
requires it to be built with Native Libraries support (including compression 
codecs, etc). It only supports Linux/*nix.
   
   Therefore if we're planning on supporting Spark 2.x we have following 
options: 
    # Implement our own version of `ZstdCompressionCodec` leveraging either 
[zstd-jni|https://github.com/luben/zstd-jni] (used by Spark internally) or 
airlift-aircompressor (claims to be faster than JNI impl).
    # Switch to `zstd` being default setting only for Spark 3 environments.
   
    ;;;
   
   ---
   
   12/Jan/22 00:45;alexey.kudinkin;Unfortunately we won't be able to support 
Zstd w/o herculean effort of hacking around Parquet implementation as it's not 
unfortunately modularized well-enough to support outside extensions.
   
    
   
   The only sensible way at this point seem to be waiting for Spark/Parquet 
upgrade to 1.12.;;;
   
   ---
   
   03/Feb/22 17:29;alexey.kudinkin;Uber's example of leveraging Zstd in lieu of 
Gzip
   
   https://eng.uber.com/cost-efficiency-big-data/;;;


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to