Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

Antoine Pitrou Mon, 13 Jul 2020 08:10:10 -0700


Agreed, but even then, if some Parquet files are generated inside of a
well-defined system which only needs to be interoperable with itself,
it's not necessaril harmful to allow LZ4 compression when writing new files.


Regards

Antoine.


Le 13/07/2020 à 17:07, Wes McKinney a écrit :
> I didn’t say to disable _reading_ them, only writing them.
> 
> On Mon, Jul 13, 2020 at 4:15 AM Antoine Pitrou <[email protected]> wrote:
> 
>>
>> I'm not sure that's a good idea.  There are probably Parquet files that
>> are only ever used with the Arrow implementation (Arrow C++, Arrow
>> Python, Arrow R...).
>>
>> I admit I'm also not terribly bothered about this, since the Parquet
>> community itself doesn't seem to care much about the issue (it has been
>> known for a long time and they could have solved it long ago).
>>
>> Regards
>>
>> Antoine.
>>
>>
>> Le 13/07/2020 à 00:11, Wes McKinney a écrit :
>>> Since there hasn't been other movement on this, we need to disable
>>> writing LZ4-compressed files until this can be investigated more
>>> thoroughly. If someone wants to submit a patch that would be helpful
>>> otherwise I can take a look in the next couple days
>>>
>>> On Thu, Jul 2, 2020 at 12:50 PM Antoine Pitrou <[email protected]>
>> wrote:
>>>>
>>>>
>>>> Well, it depends how important speed is, but LZ4 has extremely fast
>>>> decompression, even compared to Snappy:
>>>> https://github.com/lz4/lz4#benchmarks
>>>>
>>>> Regards
>>>>
>>>> Antoine.
>>>>
>>>>
>>>> Le 02/07/2020 à 19:47, Christian Hudon a écrit :
>>>>> At least for us, the advantages of Parquet are speed and
>> interoperability
>>>>> in the context of longer-term data storage, so I would tend to say
>>>>> "reasonably conservative".
>>>>>
>>>>> Le mer. 1 juill. 2020, à 09 h 32, Antoine Pitrou <[email protected]>
>> a
>>>>> écrit :
>>>>>
>>>>>>
>>>>>> I don't have a sense of how conservative Parquet users generally are.
>>>>>> Is it worth adding a LZ4_FRAMED compression option in the Parquet
>>>>>> format, or would people just not use it?
>>>>>>
>>>>>> Regards
>>>>>>
>>>>>> Antoine.
>>>>>>
>>>>>>
>>>>>> On Tue, 30 Jun 2020 14:33:17 +0200
>>>>>> "Uwe L. Korn" <[email protected]> wrote:
>>>>>>> I'm also in favor of disabling support for now. Having to deal with
>>>>>> broken files or the detection of various incompatible implementations
>> in
>>>>>> the long-term will harm more than not supporting LZ4 for a while.
>> Snappy is
>>>>>> generally more used than LZ4 in this category as it has been available
>>>>>> since the inception of Parquet and thus should be considered as a
>> viable
>>>>>> alternative.
>>>>>>>
>>>>>>> Cheers
>>>>>>> Uwe
>>>>>>>
>>>>>>> On Mon, Jun 29, 2020, at 11:48 PM, Wes McKinney wrote:
>>>>>>>> On Thu, Jun 25, 2020 at 3:31 AM Antoine Pitrou <[email protected]>
>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Le 25/06/2020 à 00:02, Wes McKinney a écrit :
>>>>>>>>>> hi folks,
>>>>>>>>>>
>>>>>>>>>> (cross-posting to dev@arrow and dev@parquet since there are
>>>>>>>>>> stakeholders in both places)
>>>>>>>>>>
>>>>>>>>>> It seems there are still problems at least with the C++
>>>>>> implementation
>>>>>>>>>> of LZ4 compression in Parquet files
>>>>>>>>>>
>>>>>>>>>> https://issues.apache.org/jira/browse/PARQUET-1241
>>>>>>>>>> https://issues.apache.org/jira/browse/PARQUET-1878
>>>>>>>>>
>>>>>>>>> I don't have any particular opinion on how to solve the LZ4 issue,
>>>>>> but
>>>>>>>>> I'd like to mention that LZ4 and ZStandard are the two most
>> efficient
>>>>>>>>> compression algorithms available, and they span different parts of
>>>>>> the
>>>>>>>>> speed/compression spectrum, so it would be a pity to disable one of
>>>>>> them.
>>>>>>>>
>>>>>>>> It's true, however I think it's worse to write LZ4-compressed files
>>>>>>>> that cannot be read by other Parquet implementations (if that's
>> what's
>>>>>>>> happening as I understand it?). If we are indeed shipping something
>>>>>>>> broken then we either should fix it or disable it until it can be
>>>>>>>> fixed.
>>>>>>>>
>>>>>>>>> Regards
>>>>>>>>>
>>>>>>>>> Antoine.
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>
>

Re: [DISCUSS] Ongoing LZ4 problems with Parquet files

Reply via email to