Hello,

This is a proposal to deprecate and/or remove LZ4 compression from the
Parquet specification.

Abstract
--------

Despite several attempts by the parquet-cpp developers, we were not
able to reach the point where LZ4-compressed Parquet files are
bidirectionally compatible between parquet-cpp and parquet-mr. Other
implementations are having, or have had, similar issues.  My conclusion
is that the Parquet spec doesn't allow independent reimplementation of
the LZ4 compression format required by parquet-mr. Therefore, LZ4
compression should be removed from the spec (possibly replaced with
another enum value for a properly-specified, interoperable, LZ4-backed
compression scheme).

About LZ4
---------

LZ4 is an extremely fast compression format and library.  Decompression
speeds of 4GB/s can routinely be achieved in pure software, with
compression speeds around 800 MB/s. Compression ratios are close to
those achieved by Snappy and LZO, but at vastly higher speeds.

Two formats are officially defined by the LZ4 project: the block format
and the frame format.  The frame format, as the name suggests, is
higher-level and contains all the information required to decompress
arbitrary data buffers.  The block format is to be used when such
information is made available separately (for example as part of
ancillary metadata, protocol headers, etc.).

Core issue
----------

LZ4 compression in Parquet (or, more accurately, in parquet-mr, which
seems to be the point of reference to look to when implementing the
Parquet format) uses a home-baked framing around LZ4 block compression
that's not specified anywhere. The only way to get information about it
is to read the Hadoop source code.

Being unspecified, it also doesn't say if there are additional
constraints on the parameters (e.g. frame size).  Such constraints will
be implementation-defined, undocumented, and only discoverable through
tedious testing and iteration, with no guarantee of ever achieving
100% compatibility.

Note that LZ4 compression itself officially comes in two formats: a
low-level block format, a higher-level framed format.  But parquet-mr
uses a third, custom framing format that's not part of the LZ4 format.

History of compatibility issues
-------------------------------

1)

parquet-cpp originally (naively?) assumed that "LZ4 compression" in the
Parquet spec meant the LZ4 block format.  After all, information
about data size is already available in the Parquet metadata, so no
specific framing is theoretically required around the compressed data.
However, it quickly occurred that this interpretation was incompatible
with files produced by parquet-mr (and vice-versa: files produced by
parquet-cpp could not be read with parquet-mr).

A first issue was posted to suggest switching the Parquet spec (and
parquet-mr) to the LZ4 framed format:
https://issues.apache.org/jira/browse/PARQUET-1241

However, this didn't come to a resolution, because people didn't want
to break compatibility with previous versions of parquet-mr (note that
this concern would necessarily switch the burden of compatibility
breakage onto other implempentations).

Relatedly, there is an issue open for Hadoop, which also didn't get a
resolution:
https://issues.apache.org/jira/browse/HADOOP-12990

2)

To avoid making things worse, parquet-cpp developers then decided to
(temporarily?) disable writing LZ4 files from C++:
https://issues.apache.org/jira/browse/ARROW-9424

(note that this is parquet-cpp deliberately crippling its own feature
set in order to workaround an issue created by an undocumented format
implemented in parquet-mr)

At this point, though, the LZ4 *reader* in parquet-cpp could still not
read files produced by parquet-mr.  So, besides being frustrating for
C++ users (and users of bindings to the C++ library, e.g. Python,
Ruby...), this decision did also not make interoperability better in
the Java -> C++ direction.

3)

parquet-cpp developers decided to implement a format detection so as to
read LZ4 files produced by parquet-mr, but also LZ4 files produced by
previous versions of parquet-cpp.
https://issues.apache.org/jira/browse/PARQUET-1878

In addition, the write path was reenabled, this time producing files
that (should!) conform to the parquet-mr implementation of LZ4
compression.

This seemed to work fine.

4)

While PARQUET-1878 (see above) seemed to work fine in basic tests, it
was later reported that some files did not read properly.

Indeed, our (parquet-cpp) compatibility code assumed that a single
"parquet-mr LZ4 frame" was ever written out for a single data page.
However, it turns out that parquet-mr can produce several such frames
for larger data pages (it seems to frame at around 128 kiB boundaries).

While this seems of course reasonable, the format being undocumented
prevented us from foreseeing this situation.  Once the issue diagnosed,
we (parquet-cpp developers) pushed a fix for it:
https://issues.apache.org/jira/browse/ARROW-11301

5)

We were happy after this later fix... only for a short time.  Now the
reverse problem happened: some LZ4 files produced by parquet-cpp cannot
be read by parquet-mr:
https://issues.apache.org/jira/browse/PARQUET-1878?focusedCommentId=17279168#comment-17279168

I decided that this couldn't be a parquet-cpp issue anymore. Evidently,
the fact that parquet-mr uses an unspecified framing format with
unspecified constraints and limits prevents other implementations from
being compatible.

So I asked the reporter to open a parquet-mr issue instead, and then
took the liberty of marking the issue as blocker:
https://issues.apache.org/jira/browse/PARQUET-1974

6)

Unsurprisingly, other Parquet implementations are also suffering from
this.

* fastparquet (a Parquet implementation written partly in Python) had
initially used the LZ4 block format, then switched to the LZ4 frame
format to achieve compatibility with parquet-cpp (since both are used
in the Python ecosystem):
https://github.com/dask/fastparquet/issues/314

Of course, fastparquet probably won't be able to read or write files
compatible with parquet-mr...

* The Rust Parquet implementation that's part of the Rust Arrow
  implementation are also having compatibility issues due to making the
  wrong guess:
https://issues.apache.org/jira/browse/ARROW-11381

* Impala seems to have had to change their LZ4 implementation to also
  match the undocumented parquet-mr framing format
https://issues.apache.org/jira/browse/IMPALA-8617

(I have no idea whether this solved all LZ4 problems for Impala, and
whether it broke compatibility with previous Impala versions)

Possible solutions
------------------

1) Do nothing.  The Parquet ecosystem is forever fragmented, even
though the official Parquet documentation doesn't say so.

2) parquet-cpp finally achieves compatibility through additional
reverse-engineering efforts.  While I have no authority to prevent
this, I also have zero personal motivation to achieve it anymore.  I
suspect other parquet-cpp developers are not significantly more
motivated than I am (but I may be wrong).

3) parquet-mr takes its share of the effort by striving to be
compatible with files produced by (the current iteration of)
parquet-cpp. I have no idea whether this is easily doable, and whether
there is any motivation or workforce to do it. Therefore, I will
conservatively rate this as "unlikely in the short term".

4) Admitting that LZ4 compatibility currently is not an achievable
goal, the LZ4 compression option is deprecated and removed from the
Parquet spec.  Implementations display a warning message when it is
being read.  Writing LZ4 files is disabled.

Discussion
----------

Solution #1 seems evidently undesirable to me (though it also seems
that nobody on the Java side was tremendously impatient to solve this
issue).

As I said, solution #2 is rather unlikely.

Java developers would have to evaluate how likely solution #3 is.  We
should avoid depending on promises that nobody is really willing to
hold, though.  Please only say "we're going to solve this" if you're
really willing to push it through.

In any case, even if solution #2 or #3 were to be implemented, it would
not necessarily help future implementations, as long as the format is
still unspecified.

In the end, while a bit frustrating for everyone, solution #4 looks like
the only reasonable one for the time being.

Future LZ4 format
-----------------

While we should deprecate the current LZ4 format, this doesn't preclude
to add another, this time well-specified, LZ4 format in the future.  It
would simply have to use a new `CompressionCodec` enum value in the
Thrift definition.

Most certainly, it should be either of the two formats officially
defined by the LZ4 project (the block format or the frame format).


Regards

Antoine.


Reply via email to