Gang, As you correctly point out, some columns don't work well with RLE. Unfortunately, without being able to look at the data it is hard for me to guess what the right compression strategies are. Based on your description, I would guess that the data doesn't have a lot of patterns to it and covers the majority of the 64 bit integer space. I think the best approach would be to make sure that RLEv3 has a low overhead representation of literals. So a literal mode something like:
header: 2 bytes (literal, 512 values, size 64bit) data: 512 * 8 bytes So the overhead would be roughly 2/4096 = 0.005. Thoughts? On Tue, Sep 18, 2018 at 3:38 PM Gopal Vijayaraghavan <gop...@apache.org> wrote: > Hi, > > > From above observation, we find that it is better to disable LEB128 > encoding while zstd is used. > > You can enable file size optimizations (automatically recommend better > layouts for compression) when > > "orc.encoding.strategy"="COMPRESSION" > > There are a bunch of bitpacking loops that's controlled by that flag > already. > > > https://github.com/facebook/zstd/issues/1325. > > If I understand that correctly, a DIRECT_V2 would also work fine for the > numeric sequences in Zstd instead? > > Cheers, > Gopal > > > >