I looked into this a while ago. Assuming that I remember correctly, the conclusion I came to was that Horizontal Bit-Parallel (HBP) might be helpful, but the vertical option was probably not appropriate.
HBP would allow Parquet readers to run predicates on multiple values at once without needing to use SIMD instructions that aren't available to JVM processes. (With SIMD instructions, you get even more value.) That would be useful, but I think we'd have to change the bit packing encoding to lay out values with the extra padding bit where predicate evaluation results end up, because the benefit is only worth the work to reorder and pack if it is reused. For Vertical Bit-Parallel (VBP), I think the reason why I didn't think it would be useful for Parquet is that it is really expensive to produce and really expensive to reconstruct values that aren't filtered out. When reconstructing more than just a few rows, as you would for large scans, it would be much more expensive. On Sun, Oct 14, 2018 at 1:26 PM Jim Apple <jbap...@apache.org> wrote: > On 2018/10/08 22:08:16, Julien Le Dem <julien.le...@wework.com.INVALID> > wrote: > > it's a variation of bit packing. right? > > I looked into it on > https://github.com/apache/parquet-format/blob/master/Encodings.md and I > believe that the Horizontal Bit-Parallel encoding in the paper is a variant > on bit packing. There are three changes: > > 1. No code is split between words > 2. Every code gets a padding bit > 3. The order of the packing is not linear; code 1 is not packed in a word > with code 2. > > The paper obviously has much more detail. :-) > > The various vertical encodings referenced in the paper (bit-slicing, > vertical bit-parallel, and BitWeaving/V) look further afield from Parquet's > bit packing. > -- Ryan Blue Software Engineer Netflix