The Parquet format specification is ambiguous about the exact details of
LZ4 compression. However, the *de facto* reference implementation in Java
(parquet-mr) uses the Hadoop LZ4 codec.

I think that it is important for Parquet c++ to have compatibility and
feature parity with parquet-mr when possible. I prefer to change the
LZ4 implementation in Parquet c++ to match the Hadoop LZ4 implementation
that is used by parquet-mr (
https://issues.apache.org/jira/browse/PARQUET-1878). I think that this
change will be quick and easy. I have an intern under my supervision who is
available to work on it full time, starting immediately. Please let me know
if we ought to proceed.

If it is not feasible to achieve compatibility in the next release, then I
am in favor of disabling lz4 support (
https://issues.apache.org/jira/browse/PARQUET-1515) until it can be fixed.

Thanks,
Steve


On Tue, 30 Jun 2020 14:33:17 +0200
"Uwe L. Korn" <uw...@xhochy.com> wrote:
> I'm also in favor of disabling support for now. Having to deal with
broken files or the detection of various incompatible implementations in
the long-term will harm more than not supporting LZ4 for a while. Snappy is
generally more used than LZ4 in this category as it has been available
since the inception of Parquet and thus should be considered as a viable
alternative.
>
> Cheers
> Uwe
>
> On Mon, Jun 29, 2020, at 11:48 PM, Wes McKinney wrote:
> > On Thu, Jun 25, 2020 at 3:31 AM Antoine Pitrou <anto...@python.org>
wrote:
> > >
> > >
> > > Le 25/06/2020 à 00:02, Wes McKinney a écrit :
> > > > hi folks,
> > > >
> > > > (cross-posting to dev@arrow and dev@parquet since there are
> > > > stakeholders in both places)
> > > >
> > > > It seems there are still problems at least with the C++
implementation
> > > > of LZ4 compression in Parquet files
> > > >
> > > > https://issues.apache.org/jira/browse/PARQUET-1241
> > > > https://issues.apache.org/jira/browse/PARQUET-1878
> > >
> > > I don't have any particular opinion on how to solve the LZ4 issue, but
> > > I'd like to mention that LZ4 and ZStandard are the two most efficient
> > > compression algorithms available, and they span different parts of the
> > > speed/compression spectrum, so it would be a pity to disable one of
them.
> >
> > It's true, however I think it's worse to write LZ4-compressed files
> > that cannot be read by other Parquet implementations (if that's what's
> > happening as I understand it?). If we are indeed shipping something
> > broken then we either should fix it or disable it until it can be
> > fixed.
> >
> > > Regards
> > >
> > > Antoine.
> >
>

Reply via email to