Hey folks,

I'm working on KUDU-1600, which has the goal of ignoring redundant
compression options on encodings that are already inherently compressed.
For example, when using BITSHUFFLE encoding, lz4 compression is already
done by the bitshuffle library, and a second pass of  LZ4 or SNAPPY
compression is just a waste of CPU.

The most straightforward solution here would be to just disable it at table
creation (or alter) time. However, the case of DICT_ENCODING is a bit more
interesting: DICT_ENCODING operates per-block in two different modes: in
the first mode, it dictionary-encodes the strings and uses bitshuffle to
encode the code integers. In that case, lz4 on top is redundant. In the
second mode, if the dictionary ends up being too-high cardinality, it falls
back to PLAIN blocks, in which case lz4 is likely a good idea. This
"encoding switching" is done inside the DICT_ENCODING block format: i.e the
cfile code always goes to the dict block decoder, which internally falls
back to using the PLAIN decoder if necessary.

Note that, within a single cfile, it's possible that some blocks have done
"fallback" whereas others are dict-encoded.

Given the above, it's not a good idea to have a blanket rule "LZ4 +
DICT_ENCODING is disallowed". Rather, we'd like to allow LZ4 + DICT, but
only compress the DICT blocks when the fallback code is activated. This
implies that we need two new bits of code:

1) after encoding a block, we need to call something like
block_encoder_->ShouldCompress(compression_codec). The encoder can then
decide whether said compression is redundant on a per-block basis.
2) when writing each block, then, we need to store whether or not it was
compressed. This is where things get tricky, since the format doesn't
currently provide for such a per-block "is_compressed' flag.

So, this brings us to the subject of this email: we need to somehow evolve
the cfile format to add such capabiltiies.

Currently, if compression is enabled in a cfile, then every block has an
8-byte prefix <uint32 uncompressed_len> <uint32 compressed_len>. We could
use a special value like '0xffffffff 0xffffffff'' to indicate a "skipped
compression" block, but it feels a bit like a hack, and wastes 8 bytes.
Another option would be to use a shorter code like a single 0xffffffff to
save 4 bytes.

Obviously these options would break backwards-compatibility of the cfile
format: the old reader would see these values and be confused, generating
some kind of error about a corrupt cfile. That's not ideal, but not sure I
see a better way around it.

Another thing we could do to generate a slightly less weird error would be
to bump the CFile version number to indicate an incompatible change, and
then use a simple scheme like a single byte "is_compressed" at the start of
each block. However, it turns out that our reader code doesn't currently
check the major/minor versions in the CFile header (woops!) so we'd
actually need to bump the 'magic' string at the start of the file.

This, too, would be backwards-incompatible and prevent downgrade from a new
version back to an old one.

Any thoughts on the best way to address this issue?

-Todd
-- 
Todd Lipcon
Software Engineer, Cloudera

Reply via email to