[jira] [Commented] (ARROW-13269) [C++] [Dataset] pyarrow.parquet.write_to_dataset does not send full schema to metadata_collector

Joris Van den Bossche (Jira) Wed, 07 Jul 2021 10:05:04 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-13269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17376714#comment-17376714
 ]


Joris Van den Bossche commented on ARROW-13269:
-----------------------------------------------

It took me a while to understand the actual issue with partition columns not 
being included in the metadata, but so to get concrete, we currently have an 
example in the docs about how you can write a {{_metadata}} file using the 
metadata_collector and write_metadata utilities 
(https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-medata-files)

{code:python}
# Write a dataset and collect metadata information of all written files
metadata_collector = []
pq.write_to_dataset(table, root_path, metadata_collector=metadata_collector)

# Write the ``_common_metadata`` parquet file without row groups statistics
pq.write_metadata(table.schema, root_path / '_common_metadata')

# Write the ``_metadata`` parquet file with row groups statistics of all files
pq.write_metadata(
    table.schema, root_path / '_metadata',
    metadata_collector=metadata_collector
)
{code}

Now, this example doesn't actually use a partitioned dataset. And once you add 
a {{partition_cols}} argument to the {{write_to_dataset}} call, the above 
snippet will fail (so it's a not very useful example ..).

For example:

{code:python}
table = pa.table({'Month': [5, 5, 5, 5, 6, 6, 6, 6], 'Day': [1, 1, 2, 2, 1, 1, 
2, 2], 'Temp': range(8)})

metadata_collector = []
pq.write_to_dataset(table, root_path, partition_cols=["Month", "Day"], 
metadata_collector=metadata_collector)

# Write the ``_metadata`` parquet file with row groups statistics of all files
pq.write_metadata(
   table.schema, root_path / '_metadata',
   metadata_collector=metadata_collector
)
{code}

gives:

{code}
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
<ipython-input-18-f3324e50d348> in <module>
      6 
      7 # Write the ``_metadata`` parquet file with row groups statistics of 
all files
----> 8 pq.write_metadata(
      9     table.schema, root_path / '_metadata',
     10     metadata_collector=metadata_collector

~/scipy/repos/arrow/python/pyarrow/parquet.py in write_metadata(schema, where, 
metadata_collector, **kwargs)
   2082     if metadata_collector is not None:
   2083         # ParquetWriter doesn't expose the metadata until it's written. 
Write
-> 2084         # it and read it again.
   2085         metadata = read_metadata(where)
   2086         for m in metadata_collector:

~/scipy/repos/arrow/python/pyarrow/_parquet.pyx in 
pyarrow._parquet.FileMetaData.append_row_groups()

RuntimeError: AppendRowGroups requires equal schemas.
{code}

Because indeed the {{table.schema}} passed to {{write_metadata}} is not 
matching the schemas of the collected metadata (and thus those can't be 
appended)

> [C++] [Dataset] pyarrow.parquet.write_to_dataset does not send full schema to 
> metadata_collector
> ------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-13269
>                 URL: https://issues.apache.org/jira/browse/ARROW-13269
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 4.0.0
>            Reporter: Weston Pace
>            Priority: Major
>
> If there are partition columns specified then the writers will only write the 
> non-partition columns and thus they will not contain the fields used for the 
> partition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-13269) [C++] [Dataset] pyarrow.parquet.write_to_dataset does not send full schema to metadata_collector

Reply via email to