Github user wgtmac commented on the issue:
https://github.com/apache/orc/pull/301
To provide some benchmark results, I did some tests on my laptop using
TPC-H 1GB dataset and C++ tools csv-import and orc-scan were used with default
configuration.
**Writer CPU Time (unit: second)**
name | zlib | zstd
-- | -- | --
customer | 1.976 | 0.777
lineitem | 50.754 | 19.990
nation | 0.002 | 0.003
orders | 11.054 | 4.895
part | 1.893 | 0.771
partsupp | 8.791 | 3.512
region | 0.002 | 0.002
supplier | 0.130 | 0.056
**Reader CPU Time (unit: second)**
name | zlib | zstd
-- | -- | --
customer | 0.084 | 0.063
lineitem | 2.263 | 2.094
nation | 0.001 | 0.001
orders | 0.454 | 0.340
part | 0.071 | 0.061
partsupp | 0.343 | 0.253
region | 0.000 | 0.001
supplier | 0.006 | 0.005
**File Size (unit: byte)**
name | zlib | zstd
-- | -- | --
customer | 7494965 | 7670751
lineitem | 162544602 | 178904712
nation | 1760 | 1882
orders | 34599561 | 38028670
part | 4273944 | 4676560
partsupp | 25766380 | 29498151
region | 1026 | 1097
supplier | 474099 | 478017
In total, ZSTD writer time has 148.6% saving and reader time has 14.4%
saving. File size is 9.4% bigger for ZSTD. The result provides a basic idea of
performance comparison between them. As we use default configuration (ZLIB
default level is -1 and ZSTD is 3), it may be unfair because ZSTD has 22 levels
while ZLIB has 9 in total. If we choose different levels or different datasets,
the result can vary a lot and ZSTD can beat ZLIB on file sizes. Overall, ZSTD
seems to be a good compression option.
---