When we are talking about deprecating an already released feature (this time LZ4) we shall always support the read path for the existing files. I think this is a fundamental requirement for every file format. The deprecation is more about the write path which has different levels. E.g. we don't suggest users to use such features, we disable the related feature by default with the option for the user to enable it or we disable the feature so the user is not able to use anymore.
For LZ4 we might choose the 2nd option so the user can still use it if it understand the risks (e.g. using only java-based components so it should work fine). After we have the new properly specified LZ4 available in parquet-format and in the implementations we should completely disable (3rd option) the write path of the original LZ4. I agree that we need more comprehensive integration testing. (Created a jira about my ideas: PARQUET-1985 <https://issues.apache.org/jira/browse/PARQUET-1985>) Meanwhile, it also turned out that for features like compression codecs the specification is the key. We might always have specific data that we did not include in the integration tests and still fails at runtime. On Wed, Feb 17, 2021 at 4:59 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > > > > I think it would be a mistake for someone who has written Hadoop-Lz4 for > > several years with parquet-mr to all of sudden be no longer able to read > > their files. (I believe that parquet-mr with this pattern has been > > incorporated into various libraries for several years now--correct me if > > I'm wrong.) > > I think this aligns with option #4 proposed above? > > It seems simply by usage #2 (more reverse engineering on the C++ side) or > #4 would be the way to go. I'll admit I am also not currently motivated to > reverse engineer this at the moment. I think on the C++ side at least we > should disable writing to avoid further fragmentation. It also isn't > clear to me if we have solid data to suggest the other compression formats > are also not broken. > > I think this has come up before but more comprehensive integration testing > is key here (there was another recent bug [1] where some of the pyarrow/C++ > written files aren't readable by some java libraries, so I think there are > potentially more important issues we need to deal with). > > [1] https://issues.apache.org/jira/browse/ARROW-11629 > > On Tue, Feb 16, 2021 at 10:09 AM Jacques Nadeau <jacq...@apache.org> > wrote: > > > There is some ambiguity in the discussion and proposals here around > > deprecating future writing versus supporting reading of already written > > data and what it means to deprecate something in the format > specification. > > > > I think it would be a mistake for someone who has written Hadoop-Lz4 for > > several years with parquet-mr to all of sudden be no longer able to read > > their files. (I believe that parquet-mr with this pattern has been > > incorporated into various libraries for several years now--correct me if > > I'm wrong.) > > > > On Tue, Feb 16, 2021 at 8:26 AM Gabor Szadovszky <ga...@apache.org> > wrote: > > > > > Thank you for the detailed summary of the LZ4 situation, Antoine! > > > > > > The Parquet file format should be properly specified for every > > > implementation. It was the mistake of the parquet-mr developers that we > > > thought the Hadoop implementation of LZ4 is according to the LZ4 > > > specification and the fault of the Parquet community that we extended > the > > > parquet-format with the LZ4 option without checking that there are 2 > > > options to choose (not talking about the 3rd Hadoop one). > > > > > > I agree that option 4 is the only way we can step forward from this > > > situation. We shall deprecate LZ4 in parquet-format and in the mean > time > > we > > > should agree on which officially specified LZ4 format do we want to > > > introduce. > > > > > > Of course, we may try to improve compatibility with the existing LZ4 > > files > > > but we should not encourage our users to use this compression for now. > > > > > > Because we already have parquet-mr releases that uses the > under-specified > > > Haddop LZ4 codec I don't feel it is that urgent to block the current > > > parquet-mr release because of this. We shall update parquet-format to > > make > > > the LZ4 situation clear then create a release. Then, we can start > working > > > on deprecating/blocking the write path of the current implementation > and > > > implement the properly specified LZ4 support in all the > implementations. > > > > > > Regards, > > > Gabor > > > > > > On Tue, Feb 16, 2021 at 3:15 PM Antoine Pitrou <anto...@python.org> > > wrote: > > > > > > > > > > > Hello, > > > > > > > > This is a proposal to deprecate and/or remove LZ4 compression from > the > > > > Parquet specification. > > > > > > > > Abstract > > > > -------- > > > > > > > > Despite several attempts by the parquet-cpp developers, we were not > > > > able to reach the point where LZ4-compressed Parquet files are > > > > bidirectionally compatible between parquet-cpp and parquet-mr. Other > > > > implementations are having, or have had, similar issues. My > conclusion > > > > is that the Parquet spec doesn't allow independent reimplementation > of > > > > the LZ4 compression format required by parquet-mr. Therefore, LZ4 > > > > compression should be removed from the spec (possibly replaced with > > > > another enum value for a properly-specified, interoperable, > LZ4-backed > > > > compression scheme). > > > > > > > > About LZ4 > > > > --------- > > > > > > > > LZ4 is an extremely fast compression format and library. > Decompression > > > > speeds of 4GB/s can routinely be achieved in pure software, with > > > > compression speeds around 800 MB/s. Compression ratios are close to > > > > those achieved by Snappy and LZO, but at vastly higher speeds. > > > > > > > > Two formats are officially defined by the LZ4 project: the block > format > > > > and the frame format. The frame format, as the name suggests, is > > > > higher-level and contains all the information required to decompress > > > > arbitrary data buffers. The block format is to be used when such > > > > information is made available separately (for example as part of > > > > ancillary metadata, protocol headers, etc.). > > > > > > > > Core issue > > > > ---------- > > > > > > > > LZ4 compression in Parquet (or, more accurately, in parquet-mr, which > > > > seems to be the point of reference to look to when implementing the > > > > Parquet format) uses a home-baked framing around LZ4 block > compression > > > > that's not specified anywhere. The only way to get information about > it > > > > is to read the Hadoop source code. > > > > > > > > Being unspecified, it also doesn't say if there are additional > > > > constraints on the parameters (e.g. frame size). Such constraints > will > > > > be implementation-defined, undocumented, and only discoverable > through > > > > tedious testing and iteration, with no guarantee of ever achieving > > > > 100% compatibility. > > > > > > > > Note that LZ4 compression itself officially comes in two formats: a > > > > low-level block format, a higher-level framed format. But parquet-mr > > > > uses a third, custom framing format that's not part of the LZ4 > format. > > > > > > > > History of compatibility issues > > > > ------------------------------- > > > > > > > > 1) > > > > > > > > parquet-cpp originally (naively?) assumed that "LZ4 compression" in > the > > > > Parquet spec meant the LZ4 block format. After all, information > > > > about data size is already available in the Parquet metadata, so no > > > > specific framing is theoretically required around the compressed > data. > > > > However, it quickly occurred that this interpretation was > incompatible > > > > with files produced by parquet-mr (and vice-versa: files produced by > > > > parquet-cpp could not be read with parquet-mr). > > > > > > > > A first issue was posted to suggest switching the Parquet spec (and > > > > parquet-mr) to the LZ4 framed format: > > > > https://issues.apache.org/jira/browse/PARQUET-1241 > > > > > > > > However, this didn't come to a resolution, because people didn't want > > > > to break compatibility with previous versions of parquet-mr (note > that > > > > this concern would necessarily switch the burden of compatibility > > > > breakage onto other implempentations). > > > > > > > > Relatedly, there is an issue open for Hadoop, which also didn't get a > > > > resolution: > > > > https://issues.apache.org/jira/browse/HADOOP-12990 > > > > > > > > 2) > > > > > > > > To avoid making things worse, parquet-cpp developers then decided to > > > > (temporarily?) disable writing LZ4 files from C++: > > > > https://issues.apache.org/jira/browse/ARROW-9424 > > > > > > > > (note that this is parquet-cpp deliberately crippling its own feature > > > > set in order to workaround an issue created by an undocumented format > > > > implemented in parquet-mr) > > > > > > > > At this point, though, the LZ4 *reader* in parquet-cpp could still > not > > > > read files produced by parquet-mr. So, besides being frustrating for > > > > C++ users (and users of bindings to the C++ library, e.g. Python, > > > > Ruby...), this decision did also not make interoperability better in > > > > the Java -> C++ direction. > > > > > > > > 3) > > > > > > > > parquet-cpp developers decided to implement a format detection so as > to > > > > read LZ4 files produced by parquet-mr, but also LZ4 files produced by > > > > previous versions of parquet-cpp. > > > > https://issues.apache.org/jira/browse/PARQUET-1878 > > > > > > > > In addition, the write path was reenabled, this time producing files > > > > that (should!) conform to the parquet-mr implementation of LZ4 > > > > compression. > > > > > > > > This seemed to work fine. > > > > > > > > 4) > > > > > > > > While PARQUET-1878 (see above) seemed to work fine in basic tests, it > > > > was later reported that some files did not read properly. > > > > > > > > Indeed, our (parquet-cpp) compatibility code assumed that a single > > > > "parquet-mr LZ4 frame" was ever written out for a single data page. > > > > However, it turns out that parquet-mr can produce several such frames > > > > for larger data pages (it seems to frame at around 128 kiB > boundaries). > > > > > > > > While this seems of course reasonable, the format being undocumented > > > > prevented us from foreseeing this situation. Once the issue > diagnosed, > > > > we (parquet-cpp developers) pushed a fix for it: > > > > https://issues.apache.org/jira/browse/ARROW-11301 > > > > > > > > 5) > > > > > > > > We were happy after this later fix... only for a short time. Now the > > > > reverse problem happened: some LZ4 files produced by parquet-cpp > cannot > > > > be read by parquet-mr: > > > > > > > > > > > > > > https://issues.apache.org/jira/browse/PARQUET-1878?focusedCommentId=17279168#comment-17279168 > > > > > > > > I decided that this couldn't be a parquet-cpp issue anymore. > Evidently, > > > > the fact that parquet-mr uses an unspecified framing format with > > > > unspecified constraints and limits prevents other implementations > from > > > > being compatible. > > > > > > > > So I asked the reporter to open a parquet-mr issue instead, and then > > > > took the liberty of marking the issue as blocker: > > > > https://issues.apache.org/jira/browse/PARQUET-1974 > > > > > > > > 6) > > > > > > > > Unsurprisingly, other Parquet implementations are also suffering from > > > > this. > > > > > > > > * fastparquet (a Parquet implementation written partly in Python) had > > > > initially used the LZ4 block format, then switched to the LZ4 frame > > > > format to achieve compatibility with parquet-cpp (since both are used > > > > in the Python ecosystem): > > > > https://github.com/dask/fastparquet/issues/314 > > > > > > > > Of course, fastparquet probably won't be able to read or write files > > > > compatible with parquet-mr... > > > > > > > > * The Rust Parquet implementation that's part of the Rust Arrow > > > > implementation are also having compatibility issues due to making > the > > > > wrong guess: > > > > https://issues.apache.org/jira/browse/ARROW-11381 > > > > > > > > * Impala seems to have had to change their LZ4 implementation to also > > > > match the undocumented parquet-mr framing format > > > > https://issues.apache.org/jira/browse/IMPALA-8617 > > > > > > > > (I have no idea whether this solved all LZ4 problems for Impala, and > > > > whether it broke compatibility with previous Impala versions) > > > > > > > > Possible solutions > > > > ------------------ > > > > > > > > 1) Do nothing. The Parquet ecosystem is forever fragmented, even > > > > though the official Parquet documentation doesn't say so. > > > > > > > > 2) parquet-cpp finally achieves compatibility through additional > > > > reverse-engineering efforts. While I have no authority to prevent > > > > this, I also have zero personal motivation to achieve it anymore. I > > > > suspect other parquet-cpp developers are not significantly more > > > > motivated than I am (but I may be wrong). > > > > > > > > 3) parquet-mr takes its share of the effort by striving to be > > > > compatible with files produced by (the current iteration of) > > > > parquet-cpp. I have no idea whether this is easily doable, and > whether > > > > there is any motivation or workforce to do it. Therefore, I will > > > > conservatively rate this as "unlikely in the short term". > > > > > > > > 4) Admitting that LZ4 compatibility currently is not an achievable > > > > goal, the LZ4 compression option is deprecated and removed from the > > > > Parquet spec. Implementations display a warning message when it is > > > > being read. Writing LZ4 files is disabled. > > > > > > > > Discussion > > > > ---------- > > > > > > > > Solution #1 seems evidently undesirable to me (though it also seems > > > > that nobody on the Java side was tremendously impatient to solve this > > > > issue). > > > > > > > > As I said, solution #2 is rather unlikely. > > > > > > > > Java developers would have to evaluate how likely solution #3 is. We > > > > should avoid depending on promises that nobody is really willing to > > > > hold, though. Please only say "we're going to solve this" if you're > > > > really willing to push it through. > > > > > > > > In any case, even if solution #2 or #3 were to be implemented, it > would > > > > not necessarily help future implementations, as long as the format is > > > > still unspecified. > > > > > > > > In the end, while a bit frustrating for everyone, solution #4 looks > > like > > > > the only reasonable one for the time being. > > > > > > > > Future LZ4 format > > > > ----------------- > > > > > > > > While we should deprecate the current LZ4 format, this doesn't > preclude > > > > to add another, this time well-specified, LZ4 format in the future. > It > > > > would simply have to use a new `CompressionCodec` enum value in the > > > > Thrift definition. > > > > > > > > Most certainly, it should be either of the two formats officially > > > > defined by the LZ4 project (the block format or the frame format). > > > > > > > > > > > > Regards > > > > > > > > Antoine. > > > > > > > > > > > > > > > > > >