+1 for this support. Concurrent writing will also benefit wide Parquet schemas that have hundreds or thousands of columns.
On Thu, Sep 10, 2020 at 6:59 PM Radu Teodorescu <[email protected]> wrote: > Hello, > I am looking into enabling C++ arrow parquet reading API to read > ColumnChunks that are in different files than the file containing the > metadata in line with original parquet specification: > > > https://github.com/apache/parquet-format/blob/01971a532e20ff8e5eba9d440289bfb753f0cf0b/src/main/thrift/parquet.thrift#L769 > < > https://github.com/apache/parquet-format/blob/01971a532e20ff8e5eba9d440289bfb753f0cf0b/src/main/thrift/parquet.thrift#L769 > > > > and I would like to get broader parquet community feedback with regards to > the merits and potential pitfalls of supporting this feature. > > On the merits front, support for this feature opens up a number of > possibilities, many of them (I’m sure) having already served as motivation > for adding this feature in the first place: > > - Allows for an entire data set (separately produced parquet files) to be > handled seamlessly as a unified parquet file > - Allows for derived parquet files (that say add or remove a column or a > row group to an existing one) to be generated without copying the common > data > - Allows for a parquet file to written and read concurrently (by producing > intermediary metadata files that point to the row groups already written in > full) > > This functionality is already supported by fastparquet implementation > https://github.com/dask/fastparquet/blob/0402257560e20b961a517ee6d770e0995e944163/fastparquet/api.py#L187 > < > https://github.com/dask/fastparquet/blob/0402257560e20b961a517ee6d770e0995e944163/fastparquet/api.py#L187> > and I am happy to assist with a java implementation if there is interest. > > I am looking forward to all the exciting insights. > Thank you > Radu -- regards, Deepak Majeti
