The Parquet format specification is ambiguous about the exact details of LZ4 compression. However, the *de facto* reference implementation in Java (parquet-mr) uses the Hadoop LZ4 codec.
I think that it is important for Parquet c++ to have compatibility and feature parity with parquet-mr when possible. I prefer to change the LZ4 implementation in Parquet c++ to match the Hadoop LZ4 implementation that is used by parquet-mr ( https://issues.apache.org/jira/browse/PARQUET-1878). I think that this change will be quick and easy. I have an intern under my supervision who is available to work on it full time, starting immediately. Please let me know if we ought to proceed. If it is not feasible to achieve compatibility in the next release, then I am in favor of disabling lz4 support ( https://issues.apache.org/jira/browse/PARQUET-1515) until it can be fixed. Thanks, Steve On Tue, 30 Jun 2020 14:33:17 +0200 "Uwe L. Korn" <uw...@xhochy.com> wrote: > I'm also in favor of disabling support for now. Having to deal with broken files or the detection of various incompatible implementations in the long-term will harm more than not supporting LZ4 for a while. Snappy is generally more used than LZ4 in this category as it has been available since the inception of Parquet and thus should be considered as a viable alternative. > > Cheers > Uwe > > On Mon, Jun 29, 2020, at 11:48 PM, Wes McKinney wrote: > > On Thu, Jun 25, 2020 at 3:31 AM Antoine Pitrou <anto...@python.org> wrote: > > > > > > > > > Le 25/06/2020 à 00:02, Wes McKinney a écrit : > > > > hi folks, > > > > > > > > (cross-posting to dev@arrow and dev@parquet since there are > > > > stakeholders in both places) > > > > > > > > It seems there are still problems at least with the C++ implementation > > > > of LZ4 compression in Parquet files > > > > > > > > https://issues.apache.org/jira/browse/PARQUET-1241 > > > > https://issues.apache.org/jira/browse/PARQUET-1878 > > > > > > I don't have any particular opinion on how to solve the LZ4 issue, but > > > I'd like to mention that LZ4 and ZStandard are the two most efficient > > > compression algorithms available, and they span different parts of the > > > speed/compression spectrum, so it would be a pity to disable one of them. > > > > It's true, however I think it's worse to write LZ4-compressed files > > that cannot be read by other Parquet implementations (if that's what's > > happening as I understand it?). If we are indeed shipping something > > broken then we either should fix it or disable it until it can be > > fixed. > > > > > Regards > > > > > > Antoine. > > >