Naively I'd vote for the approach involving a change to the magic
string and (of course) adding version checking into the reader code,
to prevent an additional magic string change from being necessary in
the future. I think that's the most predictable approach (i.e. someone
less familiar with the code would easily understand the motivation).

On Wed, Nov 30, 2016 at 10:31 AM, Todd Lipcon <[email protected]> wrote:
> Hey folks,
>
> I'm working on KUDU-1600, which has the goal of ignoring redundant
> compression options on encodings that are already inherently compressed.
> For example, when using BITSHUFFLE encoding, lz4 compression is already
> done by the bitshuffle library, and a second pass of  LZ4 or SNAPPY
> compression is just a waste of CPU.
>
> The most straightforward solution here would be to just disable it at table
> creation (or alter) time. However, the case of DICT_ENCODING is a bit more
> interesting: DICT_ENCODING operates per-block in two different modes: in
> the first mode, it dictionary-encodes the strings and uses bitshuffle to
> encode the code integers. In that case, lz4 on top is redundant. In the
> second mode, if the dictionary ends up being too-high cardinality, it falls
> back to PLAIN blocks, in which case lz4 is likely a good idea. This
> "encoding switching" is done inside the DICT_ENCODING block format: i.e the
> cfile code always goes to the dict block decoder, which internally falls
> back to using the PLAIN decoder if necessary.
>
> Note that, within a single cfile, it's possible that some blocks have done
> "fallback" whereas others are dict-encoded.
>
> Given the above, it's not a good idea to have a blanket rule "LZ4 +
> DICT_ENCODING is disallowed". Rather, we'd like to allow LZ4 + DICT, but
> only compress the DICT blocks when the fallback code is activated. This
> implies that we need two new bits of code:
>
> 1) after encoding a block, we need to call something like
> block_encoder_->ShouldCompress(compression_codec). The encoder can then
> decide whether said compression is redundant on a per-block basis.
> 2) when writing each block, then, we need to store whether or not it was
> compressed. This is where things get tricky, since the format doesn't
> currently provide for such a per-block "is_compressed' flag.
>
> So, this brings us to the subject of this email: we need to somehow evolve
> the cfile format to add such capabiltiies.
>
> Currently, if compression is enabled in a cfile, then every block has an
> 8-byte prefix <uint32 uncompressed_len> <uint32 compressed_len>. We could
> use a special value like '0xffffffff 0xffffffff'' to indicate a "skipped
> compression" block, but it feels a bit like a hack, and wastes 8 bytes.
> Another option would be to use a shorter code like a single 0xffffffff to
> save 4 bytes.
>
> Obviously these options would break backwards-compatibility of the cfile
> format: the old reader would see these values and be confused, generating
> some kind of error about a corrupt cfile. That's not ideal, but not sure I
> see a better way around it.
>
> Another thing we could do to generate a slightly less weird error would be
> to bump the CFile version number to indicate an incompatible change, and
> then use a simple scheme like a single byte "is_compressed" at the start of
> each block. However, it turns out that our reader code doesn't currently
> check the major/minor versions in the CFile header (woops!) so we'd
> actually need to bump the 'magic' string at the start of the file.
>
> This, too, would be backwards-incompatible and prevent downgrade from a new
> version back to an old one.
>
> Any thoughts on the best way to address this issue?
>
> -Todd
> --
> Todd Lipcon
> Software Engineer, Cloudera

Reply via email to