Hi everyone, In the Parquet sync-up, I mentioned that I've been testing out some new encodings lately and it would be good to get feedback on the results. I've posted the code I've been working with in two branches:
Parquet MR: https://github.com/rdblue/parquet-mr/commits/encoders Parquet Format: https://github.com/rdblue/parquet-format/tree/ zig-zag-encodings The motivation is that the RLE encoding is simple, easy to understand, and performs well, but it can't handle negatives very well and has a fixed bit-packing width. The "new" encoding that this test uses is RLE with an extra byte for bit-packed runs that encodes the bit width used to solve the fixed-width limitation. Then using that encoder (and a variant of it for longs) I added writers that will zig-zag encode values (move the sign bit to the lsb so that negatives aren't all 32 bits), and store the delta between values (also zig-zag encoded) to handle negative deltas. I've also added a couple of test floating point encodings that separate the exponent bits from the significand bits and then uses Zig-zag RLE on both sequences. There's also a variant that will xor significand values to attempt a diff-encoding like the integer deltas. The results are pretty good so far. The zig-zag delta encoding for integers seems to beat the existing delta encoding for the dataset I'm testing with in all cases. One column is 7.21 B/val with delta, and 6.65 B/val with delta-zig-zag; another column with smaller values is 1.44 B/val (delta) vs 1.26 B/val (delta-zig-zag). RLE works great if values have runs just like the current delta encoding. Floating point compression is good as well. Columns that were not dictionary-encoded in my data are now 6.84 B/val instead of 7.43 B/val with plain encoding and gzip compression. The 6.84 B/val is the xor encoding with gzip compression. Xor encoding also does well with columns that would be dictionary-encoded: not usually as small as the dictionary-encoding, but within a byte. In one case the Xor encoding was smaller than dictionary encoding. These appear to do fairly well, so it would be great if anyone else has time to try them out. I haven't benchmarked the encoding/decoding time yet. rb -- Ryan Blue Software Engineer Netflix
