Hi Rick,
Thanks for exploring this!
I am still quite new to Parquet myself, so the following might not be fully
correct, but based on my current understanding, to enable projects like
dask to write the different pieces of a Parquet dataset using pyarrow, we
need the following functionalities:
- Write a single Parquet file (for one pieces / partition) and get the
metadata of that file
-> Writing is already long possible and ARROW-5258 (GH4236) enabled
getting the metadata
- Update and combine this list of metadata objects
-> Dask needs a way to update the metadata (eg the exact file path
where they put it inside the partitioned dataset): I opened ARROW-5349 for
this.
-> We need to combine the metadata, discussed in ARROW-1983
- Write a metadata object (for both the _metadata and _common_metadata
files)
-> Also discussed in ARROW-1983. The Python interface could also
combine (step above) and write together.
But it would be good if some people more familiar with Parquet could chime
in here.
Best,
Joris
Op do 16 mei 2019 om 16:37 schreef Richard Zamora <[email protected]>:
> Note that I was asked to post here after making a similar comment on
> GitHub (https://github.com/apache/arrow/pull/4236)…
>
> I am hoping to help improve the use of pyarrow.parquet within dask (
> https://github.com/dask/dask). To this end, I put together a simple
> notebook to explore how pyarrow.parquet can be used to read/write a
> partitioned dataset without dask (see:
> https://github.com/rjzamora/notebooks/blob/master/pandas_pyarrow_simple.ipynb).
> If your search for "Assuming that a single-file metadata solution is
> currently missing" in that notebook, you will see where I am unsure of the
> best way to write/read metadata to/from a centralized location using
> pyarrow.parquet.
>
> I believe that it would be best for dask to have a way to read/write a
> single metadata file for a partitioned dataset using pyarrow (perhaps a
> ‘_metadata’ file?). Am I correct to assume that: (1) this functionality
> is missing in pyarrow, and (2) this approach is the best way to process a
> partitioned dataset in parallel?
>
> Best,
> Rick
>
> --
> Richard J. Zamora
> NVIDA
>
>
>
>
> -----------------------------------------------------------------------------------
> This email message is for the sole use of the intended recipient(s) and
> may contain
> confidential information. Any unauthorized review, use, disclosure or
> distribution
> is prohibited. If you are not the intended recipient, please contact the
> sender by
> reply email and destroy all copies of the original message.
>
> -----------------------------------------------------------------------------------
>