[GitHub] orc issue #301: ORC-395: Support ZSTD in C++ writer/reader

wgtmac Fri, 17 Aug 2018 11:53:26 -0700

Github user wgtmac commented on the issue:

    https://github.com/apache/orc/pull/301
  
    To provide some benchmark results, I did some tests on my laptop using 
TPC-H 1GB dataset and C++ tools csv-import and orc-scan were used with default 
configuration.
    
    **Writer CPU Time (unit: second)**
    
    name | zlib | zstd
    -- | -- | --
    customer | 1.976 | 0.777
    lineitem | 50.754 | 19.990
    nation | 0.002 | 0.003
    orders | 11.054 | 4.895
    part | 1.893 | 0.771
    partsupp | 8.791 | 3.512
    region | 0.002 | 0.002
    supplier | 0.130 | 0.056
    
    **Reader CPU Time (unit: second)**
    
    name | zlib | zstd
    -- | -- | --
    customer | 0.084 | 0.063
    lineitem | 2.263 | 2.094
    nation | 0.001 | 0.001
    orders | 0.454 | 0.340
    part | 0.071 | 0.061
    partsupp | 0.343 | 0.253
    region | 0.000 | 0.001
    supplier | 0.006 | 0.005
    
    **File Size (unit: byte)**
    
    name | zlib | zstd
    -- | -- | --
    customer | 7494965 | 7670751
    lineitem | 162544602 | 178904712
    nation | 1760 | 1882
    orders | 34599561 | 38028670
    part | 4273944 | 4676560
    partsupp | 25766380 | 29498151
    region | 1026 | 1097
    supplier | 474099 | 478017
    
    In total, ZSTD writer time has 148.6% saving and reader time has 14.4% 
saving. File size is 9.4% bigger for ZSTD. The result provides a basic idea of 
performance comparison between them. As we use default configuration (ZLIB 
default level is -1 and ZSTD is 3), it may be unfair because ZSTD has 22 levels 
while ZLIB has 9 in total. If we choose different levels or different datasets, 
the result can vary a lot and ZSTD can beat ZLIB on file sizes. Overall, ZSTD 
seems to be a good compression option.

---

[GitHub] orc issue #301: ORC-395: Support ZSTD in C++ writer/reader

Reply via email to