Hi, We are using zstd as the default compressor in production for ORC. Overall the performance is very good. Through our analysis, there is some room of improvement for integers.
As we know, all integers use base 128 varint encoding (a.k.a LEB128) after RLE. This works well for zlib and other compressors. However, when we use zstd, LEB128-encoded data leads to worse result than fixed 64-bit int64_t. I have created an issue in zstd community and get confirmed: https://github.com/facebook/zstd/issues/1325. To provide some data, we have an ORC file with 10 columns (4 long types and 6 string types). All 4 long columns do not fit for RLE very well, meaning that most of them are literals in the RLE output. The overall size for different settings are as below: - RLEv1 + LEB128: 8991617 bytes - RLEv2 + LEB128: 8305585 bytes - RLEv1 + fixed 64-bit: 7961360 bytes I tried to analyze the one column of the file and got the following result: - RLEv1 + zstd + LEB128: 1188651 bytes - RLEv1 + zstd + fixed 64-bit: 685522 bytes - RLEv1 + zlib + LEB128: 834729 bytes - RLEv1 + zlib + fixed 64-bit: 854529 bytes >From above observation, we find that it is better to disable LEB128 encoding while zstd is used. This can be easily achieved by bumping the file version. Any thoughts? Thanks! Gang