Naively I'd vote for the approach involving a change to the magic string and (of course) adding version checking into the reader code, to prevent an additional magic string change from being necessary in the future. I think that's the most predictable approach (i.e. someone less familiar with the code would easily understand the motivation).
On Wed, Nov 30, 2016 at 10:31 AM, Todd Lipcon <[email protected]> wrote: > Hey folks, > > I'm working on KUDU-1600, which has the goal of ignoring redundant > compression options on encodings that are already inherently compressed. > For example, when using BITSHUFFLE encoding, lz4 compression is already > done by the bitshuffle library, and a second pass of LZ4 or SNAPPY > compression is just a waste of CPU. > > The most straightforward solution here would be to just disable it at table > creation (or alter) time. However, the case of DICT_ENCODING is a bit more > interesting: DICT_ENCODING operates per-block in two different modes: in > the first mode, it dictionary-encodes the strings and uses bitshuffle to > encode the code integers. In that case, lz4 on top is redundant. In the > second mode, if the dictionary ends up being too-high cardinality, it falls > back to PLAIN blocks, in which case lz4 is likely a good idea. This > "encoding switching" is done inside the DICT_ENCODING block format: i.e the > cfile code always goes to the dict block decoder, which internally falls > back to using the PLAIN decoder if necessary. > > Note that, within a single cfile, it's possible that some blocks have done > "fallback" whereas others are dict-encoded. > > Given the above, it's not a good idea to have a blanket rule "LZ4 + > DICT_ENCODING is disallowed". Rather, we'd like to allow LZ4 + DICT, but > only compress the DICT blocks when the fallback code is activated. This > implies that we need two new bits of code: > > 1) after encoding a block, we need to call something like > block_encoder_->ShouldCompress(compression_codec). The encoder can then > decide whether said compression is redundant on a per-block basis. > 2) when writing each block, then, we need to store whether or not it was > compressed. This is where things get tricky, since the format doesn't > currently provide for such a per-block "is_compressed' flag. > > So, this brings us to the subject of this email: we need to somehow evolve > the cfile format to add such capabiltiies. > > Currently, if compression is enabled in a cfile, then every block has an > 8-byte prefix <uint32 uncompressed_len> <uint32 compressed_len>. We could > use a special value like '0xffffffff 0xffffffff'' to indicate a "skipped > compression" block, but it feels a bit like a hack, and wastes 8 bytes. > Another option would be to use a shorter code like a single 0xffffffff to > save 4 bytes. > > Obviously these options would break backwards-compatibility of the cfile > format: the old reader would see these values and be confused, generating > some kind of error about a corrupt cfile. That's not ideal, but not sure I > see a better way around it. > > Another thing we could do to generate a slightly less weird error would be > to bump the CFile version number to indicate an incompatible change, and > then use a simple scheme like a single byte "is_compressed" at the start of > each block. However, it turns out that our reader code doesn't currently > check the major/minor versions in the CFile header (woops!) so we'd > actually need to bump the 'magic' string at the start of the file. > > This, too, would be backwards-incompatible and prevent downgrade from a new > version back to an old one. > > Any thoughts on the best way to address this issue? > > -Todd > -- > Todd Lipcon > Software Engineer, Cloudera
