Hi Wes,
On Wed, 15 May 2024 18:56:42 -0500 Wes McKinney <wesmck...@gmail.com> wrote: > -- I am not sure how you fully make this problem go away in generality > without doing away with Thrift at the footer level, but at that point you > are making such a disruptive change that why not try to fix some other > problems as well? If you go down that rabbit hole, you have created a new > file format that is no longer Parquet, and so calling it ParquetV3 is > probably misleading. I agree that redesigning the metadata structure and encoding is probably a new format entirely. > - Parquet's data page format has worked well over time, but aside from > fixing the metadata overhead issue, the data page itself needs to be > extensible. There is DATA_PAGE_V2, but structurally it is the same as > DATA_PAGE{_V1} with the repetition and definition levels kept outside of > the compressed portion. You can kind of think of Parquet's data page > structure as one possible choice of options in a general purpose nested > encoding scheme (most implementations do dictionary+rle and falls back on > plain encoding when the dictionary exceeds a certain size). We could create > a DATA_PAGE_V3 that allows for an whole alternate -- and even pluggable -- > encoding scheme, without changing the metadata, and this would be valuable > to the Parquet community, even if most mainstream Parquet users (e.g. > Spark) will opt not to use it for a period of some years for compatibility > reasons. Do you mean allowing custom encodings just like Arrow has extension types? It would indeed allow experimenting and slowly solidifying novel encoding schemes. A closely related thing that would be useful is extension types in Parquet (instead of having all logical types reified in the Thrift definitions). This was mentioned in the discussion for https://github.com/apache/parquet-format/pull/240 > - Another problem that I haven't seen mentioned but maybe I just missed it > is that Parquet is very painful to decode on accelerators like GPUs. RAPIDS > has created a CUDA implementation of Parquet decoding (including decoding > the Thrift data page headers on the GPU), but there are two primary > problems 1) there is metadata that is necessary for control flow on the > host side within the ColumnChunk in the row group and 2) there are not > sufficient memory preallocation hints -- how much memory you need to > allocate to fully decode a data page. This is also discussed in > https://github.com/facebookincubator/nimble/discussions/50 The latest format additions should make this better. It would be good to hear from GPU people if more metadata is needed: https://github.com/apache/parquet-format/blob/079a2dff06e32b7d1ad8c9aa67f2e2128fb5ffa5/src/main/thrift/parquet.thrift#L194-L238 Regards Antoine.