Hello, I am looking into enabling C++ arrow parquet reading API to read ColumnChunks that are in different files than the file containing the metadata in line with original parquet specification:
https://github.com/apache/parquet-format/blob/01971a532e20ff8e5eba9d440289bfb753f0cf0b/src/main/thrift/parquet.thrift#L769 <https://github.com/apache/parquet-format/blob/01971a532e20ff8e5eba9d440289bfb753f0cf0b/src/main/thrift/parquet.thrift#L769> and I would like to get broader parquet community feedback with regards to the merits and potential pitfalls of supporting this feature. On the merits front, support for this feature opens up a number of possibilities, many of them (I’m sure) having already served as motivation for adding this feature in the first place: - Allows for an entire data set (separately produced parquet files) to be handled seamlessly as a unified parquet file - Allows for derived parquet files (that say add or remove a column or a row group to an existing one) to be generated without copying the common data - Allows for a parquet file to written and read concurrently (by producing intermediary metadata files that point to the row groups already written in full) This functionality is already supported by fastparquet implementation https://github.com/dask/fastparquet/blob/0402257560e20b961a517ee6d770e0995e944163/fastparquet/api.py#L187 <https://github.com/dask/fastparquet/blob/0402257560e20b961a517ee6d770e0995e944163/fastparquet/api.py#L187> and I am happy to assist with a java implementation if there is interest. I am looking forward to all the exciting insights. Thank you Radu