+1 for this support.
Concurrent writing will also benefit wide Parquet schemas that have
hundreds or thousands of columns.


On Thu, Sep 10, 2020 at 6:59 PM Radu Teodorescu
<[email protected]> wrote:

> Hello,
> I am looking into enabling C++ arrow parquet reading API to read
> ColumnChunks that are in different files than the file containing the
> metadata in line with original parquet specification:
>
>
> https://github.com/apache/parquet-format/blob/01971a532e20ff8e5eba9d440289bfb753f0cf0b/src/main/thrift/parquet.thrift#L769
> <
> https://github.com/apache/parquet-format/blob/01971a532e20ff8e5eba9d440289bfb753f0cf0b/src/main/thrift/parquet.thrift#L769
> >
>
> and I would like to get broader parquet community feedback with regards to
> the merits and potential pitfalls of supporting this feature.
>
> On the merits front, support for this feature opens up a number of
> possibilities, many of them (I’m sure) having already served as motivation
> for adding this feature in the first place:
>
> - Allows for an entire data set (separately produced parquet files) to be
> handled seamlessly as a unified parquet file
> - Allows for derived parquet files (that say add or remove a column or a
> row group to an existing one) to be generated without copying the common
> data
> - Allows for a parquet file to written and read concurrently (by producing
> intermediary metadata files that point to the row groups already written in
> full)
>
> This functionality is already supported by fastparquet implementation
> https://github.com/dask/fastparquet/blob/0402257560e20b961a517ee6d770e0995e944163/fastparquet/api.py#L187
> <
> https://github.com/dask/fastparquet/blob/0402257560e20b961a517ee6d770e0995e944163/fastparquet/api.py#L187>
> and I am happy to assist with a java implementation if there is interest.
>
> I am looking forward to all the exciting insights.
> Thank you
> Radu



-- 
regards,
Deepak Majeti

Reply via email to