[
https://issues.apache.org/jira/browse/ARROW-13269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17376721#comment-17376721
]
Joris Van den Bossche commented on ARROW-13269:
-----------------------------------------------
But so the question is: do we actually need to include the partition columns in
the metadata, or do we need to update the example to work with a partitioned
dataset by removing the partition columns from the table.schema passes to
{{write_metadata}}.
Comparing with what dask does, using the above example:
{code:python}
import dask.dataframe as dd
ddf = dd.from_pandas(table.to_pandas(), npartitions=1)
root_path2 = pathlib.Path.cwd() / "test_metadata_dask"
ddf.to_parquet(root_path2, partition_on=["Month", "Day"], engine="pyarrow")
meta = pq.read_metadata(root_path2 / "_metadata")
{code}
Here, the metadata also doesn't include the partition columns:
{code}
>>> meta = pq.read_metadata(root_path2 / "_metadata")
>>> meta.schema
<pyarrow._parquet.ParquetSchema object at 0x7f1459573ac0>
required group field_id=-1 schema {
optional int64 field_id=-1 Temp;
optional int64 field_id=-1 __null_dask_index__;
}
{code}
> [C++] [Dataset] pyarrow.parquet.write_to_dataset does not send full schema to
> metadata_collector
> ------------------------------------------------------------------------------------------------
>
> Key: ARROW-13269
> URL: https://issues.apache.org/jira/browse/ARROW-13269
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++
> Affects Versions: 4.0.0
> Reporter: Weston Pace
> Priority: Major
>
> If there are partition columns specified then the writers will only write the
> non-partition columns and thus they will not contain the fields used for the
> partition.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)