Implementing support for interpreting ColumnChunk.file_path in parquet file readers

Radu Teodorescu Thu, 10 Sep 2020 16:00:10 -0700

Hello,
I am looking into enabling C++ arrow parquet reading API to read ColumnChunks 
that are in different files than the file containing the metadata in line with 
original parquet specification:


https://github.com/apache/parquet-format/blob/01971a532e20ff8e5eba9d440289bfb753f0cf0b/src/main/thrift/parquet.thrift#L769
 
<https://github.com/apache/parquet-format/blob/01971a532e20ff8e5eba9d440289bfb753f0cf0b/src/main/thrift/parquet.thrift#L769>

and I would like to get broader parquet community feedback with regards to the 
merits and potential pitfalls of supporting this feature.

On the merits front, support for this feature opens up a number of 
possibilities, many of them (I’m sure) having already served as motivation for 
adding this feature in the first place:

- Allows for an entire data set (separately produced parquet files) to be 
handled seamlessly as a unified parquet file 
- Allows for derived parquet files (that say add or remove a column or a row 
group to an existing one) to be generated without copying the common data
- Allows for a parquet file to written and read concurrently (by producing 
intermediary metadata files that point to the row groups already written in 
full)

This functionality is already supported by fastparquet implementation 
https://github.com/dask/fastparquet/blob/0402257560e20b961a517ee6d770e0995e944163/fastparquet/api.py#L187
 
<https://github.com/dask/fastparquet/blob/0402257560e20b961a517ee6d770e0995e944163/fastparquet/api.py#L187>
  and I am happy to assist with a java implementation if there is interest.

I am looking forward to all the exciting insights.
Thank you
Radu

Implementing support for interpreting ColumnChunk.file_path in parquet file readers

Reply via email to