[jira] [Commented] (ARROW-13269) [C++] [Dataset] pyarrow.parquet.write_to_dataset does not send full schema to metadata_collector

Joris Van den Bossche (Jira) Wed, 07 Jul 2021 10:19:27 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-13269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17376721#comment-17376721
 ]


Joris Van den Bossche commented on ARROW-13269:
-----------------------------------------------

But so the question is: do we actually need to include the partition columns in 
the metadata, or do we need to update the example to work with a partitioned 
dataset by removing the partition columns from the table.schema passes to 
{{write_metadata}}.

Comparing with what dask does, using the above example:

{code:python}
import dask.dataframe as dd
ddf = dd.from_pandas(table.to_pandas(), npartitions=1)

root_path2 = pathlib.Path.cwd() / "test_metadata_dask"
ddf.to_parquet(root_path2, partition_on=["Month", "Day"], engine="pyarrow")
meta = pq.read_metadata(root_path2 / "_metadata")
{code}

Here, the metadata also doesn't include the partition columns:

{code}
>>> meta = pq.read_metadata(root_path2 / "_metadata")
>>> meta.schema
<pyarrow._parquet.ParquetSchema object at 0x7f1459573ac0>
required group field_id=-1 schema {
  optional int64 field_id=-1 Temp;
  optional int64 field_id=-1 __null_dask_index__;
}
{code}

> [C++] [Dataset] pyarrow.parquet.write_to_dataset does not send full schema to 
> metadata_collector
> ------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-13269
>                 URL: https://issues.apache.org/jira/browse/ARROW-13269
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>    Affects Versions: 4.0.0
>            Reporter: Weston Pace
>            Priority: Major
>
> If there are partition columns specified then the writers will only write the 
> non-partition columns and thus they will not contain the fields used for the 
> partition.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-13269) [C++] [Dataset] pyarrow.parquet.write_to_dataset does not send full schema to metadata_collector

Reply via email to