Re: Metadata for partitioned datasets in pyarrow.parquet

Joris Van den Bossche Thu, 16 May 2019 12:48:14 -0700

Missed the email of Wes, but yeah, I think we basically said the same.

Answer to another question you raised in the notebook:


> [about writing a _common_metadata file] ... uses the schema object for
> the 0th partition. This actually means that not *all* information in
> _common_metadata will be true for the entire dataset. More specifically,
> the "index_columns" [in the pandas_metadata] its "start" and "stop"
> values will correspond to the 0th partition, rather than the global dataset.
>
That's indeed a problem with storing the index information not as a column.
We have seen some other related issues about this, such as ARROW-5138 (when
reading a single row group of a parquet file).
In those cases, I think the only solution is to ignore this part of the
metadata. But, specifically for dask, I think the idea actually is to not
write the index at all (based on discussion in
https://github.com/dask/dask/pull/4336), so then you would not have this
problem.

However, note that writing the _common_metadata file like that from the
schema of the first partition might not be fully correct: it might have the
correct schema, but it will not have the correct dataset size (eg number of
row groups). Although I am not sure what the "common practice" is on this
aspect of _common_metadata file.

Joris



Op do 16 mei 2019 om 20:50 schreef Joris Van den Bossche <
jorisvandenboss...@gmail.com>:

> Hi Rick,
>
> Thanks for exploring this!
>
> I am still quite new to Parquet myself, so the following might not be
> fully correct, but based on my current understanding, to enable projects
> like dask to write the different pieces of a Parquet dataset using pyarrow,
> we need the following functionalities:
>
> - Write a single Parquet file (for one pieces / partition) and get the
> metadata of that file
>     -> Writing is already long possible and ARROW-5258 (GH4236) enabled
> getting the metadata
> - Update and combine this list of metadata objects
>     -> Dask needs a way to update the metadata (eg the exact file path
> where they put it inside the partitioned dataset): I opened ARROW-5349
> for this.
>     -> We need to combine the metadata, discussed in ARROW-1983
> - Write a metadata object (for both the _metadata and _common_metadata
> files)
>     -> Also discussed in ARROW-1983. The Python interface could also
> combine (step above) and write together.
>
> But it would be good if some people more familiar with Parquet could chime
> in here.
>
> Best,
> Joris
>
> Op do 16 mei 2019 om 16:37 schreef Richard Zamora <rzam...@nvidia.com>:
>
>> Note that I was asked to post here after making a similar comment on
>> GitHub (https://github.com/apache/arrow/pull/4236)…
>>
>> I am hoping to help improve the use of pyarrow.parquet within dask (
>> https://github.com/dask/dask). To this end, I put together a simple
>> notebook to explore how pyarrow.parquet can be used to read/write a
>> partitioned dataset without dask (see:
>> https://github.com/rjzamora/notebooks/blob/master/pandas_pyarrow_simple.ipynb).
>> If your search for "Assuming that a single-file metadata solution is
>> currently missing" in that notebook, you will see where I am unsure of the
>> best way to write/read metadata to/from a centralized location using
>> pyarrow.parquet.
>>
>> I believe that it would be best for dask to have a way to read/write a
>> single metadata file for a partitioned dataset using pyarrow (perhaps a
>> ‘_metadata’ file?).   Am I correct to assume that: (1) this functionality
>> is missing in pyarrow, and (2) this  approach is the best way to process a
>> partitioned dataset in parallel?
>>
>> Best,
>> Rick
>>
>> --
>> Richard J. Zamora
>> NVIDA
>>
>>
>>
>>
>> -----------------------------------------------------------------------------------
>> This email message is for the sole use of the intended recipient(s) and
>> may contain
>> confidential information.  Any unauthorized review, use, disclosure or
>> distribution
>> is prohibited.  If you are not the intended recipient, please contact the
>> sender by
>> reply email and destroy all copies of the original message.
>>
>> -----------------------------------------------------------------------------------
>>
>

Re: Metadata for partitioned datasets in pyarrow.parquet

Reply via email to