Re: Support for Multiple GZIP Members in Page

Antoine Pitrou Thu, 19 Oct 2023 02:41:14 -0700


To reiterate what I've already said on the GH issue page, I'm skeptical
about the use case presented by the original submitter (parallel
GZip compression of data pages).


1) GZip is technically obsolete compared to Zstd, Lz4 or even Snappy or
Brotli;

2) data pages are meant to be small (L1-cache sized, typically), so
splitting them in even smaller chunks for compression doesn't sound
like a terrific strategy;

3) systems using Parquet generally parallelize at a higher level
already (for example at the row group or column chunk level), so
probably wouldn't gain much by also parallelizing data compression.

I wouldn't mind the proposed spec addition, but for now this is
occurring because of a single person pushing for it on Github, so the
motivation seems rather weak.

Regards

Antoine.



On Thu, 19 Oct 2023 10:24:57 +0100
Raphael Taylor-Davies
<r.taylordav...@googlemail.com.INVALID>
wrote:
> Hi All,
> 
> Recently it was reported that many of the arrow parquet readers, 
> including arrow-cpp, pyarrow and arrow-rs, do not support GZIP 
> compressed pages containing multiple members [3]. It would also appear 
> other parquet implementations such as DuckDB have similar issues [4]. 
> This in turn led to some discussion as to whether this was permissible 
> according to the parquet specification [5], with the proposed compromise 
> to explicitly state that multiple members should be supported by 
> readers, but to recommend writers don't produce such pages by default 
> given the non-trivial install base where this will cause issues 
> including silent data corruption. I have tried to encode this in [6], 
> and welcome any feedback.
> 
> Kind Regards,
> 
> Raphael Taylor-Davies
> 
> [1]: https://github.com/apache/arrow/pull/38272
> [2]: https://github.com/apache/arrow-rs/pull/4951
> [3]: https://datatracker.ietf.org/doc/html/rfc1952
> [4]: 
> https://github.com/apache/parquet-testing/pull/41#issuecomment-1770410715
> [5]: https://github.com/apache/parquet-testing/pull/41
> [6]: https://github.com/apache/parquet-format/pull/218
> 
>

Re: Support for Multiple GZIP Members in Page

Reply via email to