Github user wgtmac commented on the issue: https://github.com/apache/orc/pull/301 To provide some benchmark results, I did some tests on my laptop using TPC-H 1GB dataset and C++ tools csv-import and orc-scan were used with default configuration. **Writer CPU Time (unit: second)** name | zlib | zstd -- | -- | -- customer | 1.976 | 0.777 lineitem | 50.754 | 19.990 nation | 0.002 | 0.003 orders | 11.054 | 4.895 part | 1.893 | 0.771 partsupp | 8.791 | 3.512 region | 0.002 | 0.002 supplier | 0.130 | 0.056 **Reader CPU Time (unit: second)** name | zlib | zstd -- | -- | -- customer | 0.084 | 0.063 lineitem | 2.263 | 2.094 nation | 0.001 | 0.001 orders | 0.454 | 0.340 part | 0.071 | 0.061 partsupp | 0.343 | 0.253 region | 0.000 | 0.001 supplier | 0.006 | 0.005 **File Size (unit: byte)** name | zlib | zstd -- | -- | -- customer | 7494965 | 7670751 lineitem | 162544602 | 178904712 nation | 1760 | 1882 orders | 34599561 | 38028670 part | 4273944 | 4676560 partsupp | 25766380 | 29498151 region | 1026 | 1097 supplier | 474099 | 478017 In total, ZSTD writer time has 148.6% saving and reader time has 14.4% saving. File size is 9.4% bigger for ZSTD. The result provides a basic idea of performance comparison between them. As we use default configuration (ZLIB default level is -1 and ZSTD is 3), it may be unfair because ZSTD has 22 levels while ZLIB has 9 in total. If we choose different levels or different datasets, the result can vary a lot and ZSTD can beat ZLIB on file sizes. Overall, ZSTD seems to be a good compression option.
---