Hi everyone,

In the Parquet sync-up, I mentioned that I've been testing out some new
encodings lately and it would be good to get feedback on the results. I've
posted the code I've been working with in two branches:

  Parquet MR: https://github.com/rdblue/parquet-mr/commits/encoders
  Parquet Format: https://github.com/rdblue/parquet-format/tree/
zig-zag-encodings

The motivation is that the RLE encoding is simple, easy to understand, and
performs well, but it can't handle negatives very well and has a fixed
bit-packing width. The "new" encoding that this test uses is RLE with an
extra byte for bit-packed runs that encodes the bit width used to solve the
fixed-width limitation. Then using that encoder (and a variant of it for
longs) I added writers that will zig-zag encode values (move the sign bit
to the lsb so that negatives aren't all 32 bits), and store the delta
between values (also zig-zag encoded) to handle negative deltas.

I've also added a couple of test floating point encodings that separate the
exponent bits from the significand bits and then uses Zig-zag RLE on both
sequences. There's also a variant that will xor significand values to
attempt a diff-encoding like the integer deltas.

The results are pretty good so far. The zig-zag delta encoding for integers
seems to beat the existing delta encoding for the dataset I'm testing with
in all cases. One column is 7.21 B/val with delta, and 6.65 B/val with
delta-zig-zag; another column with smaller values is 1.44 B/val (delta) vs
1.26 B/val (delta-zig-zag). RLE works great if values have runs just like
the current delta encoding.

Floating point compression is good as well. Columns that were not
dictionary-encoded in my data are now 6.84 B/val instead of 7.43 B/val with
plain encoding and gzip compression. The 6.84 B/val is the xor encoding
with gzip compression. Xor encoding also does well with columns that would
be dictionary-encoded: not usually as small as the dictionary-encoding, but
within a byte. In one case the Xor encoding was smaller than dictionary
encoding.

These appear to do fairly well, so it would be great if anyone else has
time to try them out. I haven't benchmarked the encoding/decoding time yet.

rb

-- 
Ryan Blue
Software Engineer
Netflix

Reply via email to