To reiterate what I've already said on the GH issue page, I'm skeptical about the use case presented by the original submitter (parallel GZip compression of data pages).
1) GZip is technically obsolete compared to Zstd, Lz4 or even Snappy or Brotli; 2) data pages are meant to be small (L1-cache sized, typically), so splitting them in even smaller chunks for compression doesn't sound like a terrific strategy; 3) systems using Parquet generally parallelize at a higher level already (for example at the row group or column chunk level), so probably wouldn't gain much by also parallelizing data compression. I wouldn't mind the proposed spec addition, but for now this is occurring because of a single person pushing for it on Github, so the motivation seems rather weak. Regards Antoine. On Thu, 19 Oct 2023 10:24:57 +0100 Raphael Taylor-Davies <r.taylordav...@googlemail.com.INVALID> wrote: > Hi All, > > Recently it was reported that many of the arrow parquet readers, > including arrow-cpp, pyarrow and arrow-rs, do not support GZIP > compressed pages containing multiple members [3]. It would also appear > other parquet implementations such as DuckDB have similar issues [4]. > This in turn led to some discussion as to whether this was permissible > according to the parquet specification [5], with the proposed compromise > to explicitly state that multiple members should be supported by > readers, but to recommend writers don't produce such pages by default > given the non-trivial install base where this will cause issues > including silent data corruption. I have tried to encode this in [6], > and welcome any feedback. > > Kind Regards, > > Raphael Taylor-Davies > > [1]: https://github.com/apache/arrow/pull/38272 > [2]: https://github.com/apache/arrow-rs/pull/4951 > [3]: https://datatracker.ietf.org/doc/html/rfc1952 > [4]: > https://github.com/apache/parquet-testing/pull/41#issuecomment-1770410715 > [5]: https://github.com/apache/parquet-testing/pull/41 > [6]: https://github.com/apache/parquet-format/pull/218 > >