Hello,
This is a proposal to deprecate and/or remove LZ4 compression from the Parquet specification. Abstract -------- Despite several attempts by the parquet-cpp developers, we were not able to reach the point where LZ4-compressed Parquet files are bidirectionally compatible between parquet-cpp and parquet-mr. Other implementations are having, or have had, similar issues. My conclusion is that the Parquet spec doesn't allow independent reimplementation of the LZ4 compression format required by parquet-mr. Therefore, LZ4 compression should be removed from the spec (possibly replaced with another enum value for a properly-specified, interoperable, LZ4-backed compression scheme). About LZ4 --------- LZ4 is an extremely fast compression format and library. Decompression speeds of 4GB/s can routinely be achieved in pure software, with compression speeds around 800 MB/s. Compression ratios are close to those achieved by Snappy and LZO, but at vastly higher speeds. Two formats are officially defined by the LZ4 project: the block format and the frame format. The frame format, as the name suggests, is higher-level and contains all the information required to decompress arbitrary data buffers. The block format is to be used when such information is made available separately (for example as part of ancillary metadata, protocol headers, etc.). Core issue ---------- LZ4 compression in Parquet (or, more accurately, in parquet-mr, which seems to be the point of reference to look to when implementing the Parquet format) uses a home-baked framing around LZ4 block compression that's not specified anywhere. The only way to get information about it is to read the Hadoop source code. Being unspecified, it also doesn't say if there are additional constraints on the parameters (e.g. frame size). Such constraints will be implementation-defined, undocumented, and only discoverable through tedious testing and iteration, with no guarantee of ever achieving 100% compatibility. Note that LZ4 compression itself officially comes in two formats: a low-level block format, a higher-level framed format. But parquet-mr uses a third, custom framing format that's not part of the LZ4 format. History of compatibility issues ------------------------------- 1) parquet-cpp originally (naively?) assumed that "LZ4 compression" in the Parquet spec meant the LZ4 block format. After all, information about data size is already available in the Parquet metadata, so no specific framing is theoretically required around the compressed data. However, it quickly occurred that this interpretation was incompatible with files produced by parquet-mr (and vice-versa: files produced by parquet-cpp could not be read with parquet-mr). A first issue was posted to suggest switching the Parquet spec (and parquet-mr) to the LZ4 framed format: https://issues.apache.org/jira/browse/PARQUET-1241 However, this didn't come to a resolution, because people didn't want to break compatibility with previous versions of parquet-mr (note that this concern would necessarily switch the burden of compatibility breakage onto other implempentations). Relatedly, there is an issue open for Hadoop, which also didn't get a resolution: https://issues.apache.org/jira/browse/HADOOP-12990 2) To avoid making things worse, parquet-cpp developers then decided to (temporarily?) disable writing LZ4 files from C++: https://issues.apache.org/jira/browse/ARROW-9424 (note that this is parquet-cpp deliberately crippling its own feature set in order to workaround an issue created by an undocumented format implemented in parquet-mr) At this point, though, the LZ4 *reader* in parquet-cpp could still not read files produced by parquet-mr. So, besides being frustrating for C++ users (and users of bindings to the C++ library, e.g. Python, Ruby...), this decision did also not make interoperability better in the Java -> C++ direction. 3) parquet-cpp developers decided to implement a format detection so as to read LZ4 files produced by parquet-mr, but also LZ4 files produced by previous versions of parquet-cpp. https://issues.apache.org/jira/browse/PARQUET-1878 In addition, the write path was reenabled, this time producing files that (should!) conform to the parquet-mr implementation of LZ4 compression. This seemed to work fine. 4) While PARQUET-1878 (see above) seemed to work fine in basic tests, it was later reported that some files did not read properly. Indeed, our (parquet-cpp) compatibility code assumed that a single "parquet-mr LZ4 frame" was ever written out for a single data page. However, it turns out that parquet-mr can produce several such frames for larger data pages (it seems to frame at around 128 kiB boundaries). While this seems of course reasonable, the format being undocumented prevented us from foreseeing this situation. Once the issue diagnosed, we (parquet-cpp developers) pushed a fix for it: https://issues.apache.org/jira/browse/ARROW-11301 5) We were happy after this later fix... only for a short time. Now the reverse problem happened: some LZ4 files produced by parquet-cpp cannot be read by parquet-mr: https://issues.apache.org/jira/browse/PARQUET-1878?focusedCommentId=17279168#comment-17279168 I decided that this couldn't be a parquet-cpp issue anymore. Evidently, the fact that parquet-mr uses an unspecified framing format with unspecified constraints and limits prevents other implementations from being compatible. So I asked the reporter to open a parquet-mr issue instead, and then took the liberty of marking the issue as blocker: https://issues.apache.org/jira/browse/PARQUET-1974 6) Unsurprisingly, other Parquet implementations are also suffering from this. * fastparquet (a Parquet implementation written partly in Python) had initially used the LZ4 block format, then switched to the LZ4 frame format to achieve compatibility with parquet-cpp (since both are used in the Python ecosystem): https://github.com/dask/fastparquet/issues/314 Of course, fastparquet probably won't be able to read or write files compatible with parquet-mr... * The Rust Parquet implementation that's part of the Rust Arrow implementation are also having compatibility issues due to making the wrong guess: https://issues.apache.org/jira/browse/ARROW-11381 * Impala seems to have had to change their LZ4 implementation to also match the undocumented parquet-mr framing format https://issues.apache.org/jira/browse/IMPALA-8617 (I have no idea whether this solved all LZ4 problems for Impala, and whether it broke compatibility with previous Impala versions) Possible solutions ------------------ 1) Do nothing. The Parquet ecosystem is forever fragmented, even though the official Parquet documentation doesn't say so. 2) parquet-cpp finally achieves compatibility through additional reverse-engineering efforts. While I have no authority to prevent this, I also have zero personal motivation to achieve it anymore. I suspect other parquet-cpp developers are not significantly more motivated than I am (but I may be wrong). 3) parquet-mr takes its share of the effort by striving to be compatible with files produced by (the current iteration of) parquet-cpp. I have no idea whether this is easily doable, and whether there is any motivation or workforce to do it. Therefore, I will conservatively rate this as "unlikely in the short term". 4) Admitting that LZ4 compatibility currently is not an achievable goal, the LZ4 compression option is deprecated and removed from the Parquet spec. Implementations display a warning message when it is being read. Writing LZ4 files is disabled. Discussion ---------- Solution #1 seems evidently undesirable to me (though it also seems that nobody on the Java side was tremendously impatient to solve this issue). As I said, solution #2 is rather unlikely. Java developers would have to evaluate how likely solution #3 is. We should avoid depending on promises that nobody is really willing to hold, though. Please only say "we're going to solve this" if you're really willing to push it through. In any case, even if solution #2 or #3 were to be implemented, it would not necessarily help future implementations, as long as the format is still unspecified. In the end, while a bit frustrating for everyone, solution #4 looks like the only reasonable one for the time being. Future LZ4 format ----------------- While we should deprecate the current LZ4 format, this doesn't preclude to add another, this time well-specified, LZ4 format in the future. It would simply have to use a new `CompressionCodec` enum value in the Thrift definition. Most certainly, it should be either of the two formats officially defined by the LZ4 project (the block format or the frame format). Regards Antoine.