[
https://issues.apache.org/jira/browse/ARROW-15311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joris Van den Bossche updated ARROW-15311:
------------------------------------------
Component/s: Python
> [Python][Docs] Opening a partitioned dataset with schema and filter
> -------------------------------------------------------------------
>
> Key: ARROW-15311
> URL: https://issues.apache.org/jira/browse/ARROW-15311
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Documentation, Python
> Reporter: Alenka Frim
> Priority: Major
> Labels: docs, python
>
> Add a note to the docs that if partitioning and schema are both specified at
> opening of a dataset and partitioning names are not included in the data,
> schema needs to include the partitioning names (directory or hive
> partitioning) in a case that filtering will be done.
> Example:
> {code:python}
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pyarrow.dataset as ds
> # Define the data
> table = pa.table({'one': [-1, np.nan, 2.5],
> 'two': ['foo', 'bar', 'baz'],
> 'three': [True, False, True]})
> # Write to partitioned dataset
> # The files will include columns "two" and "three"
> pq.write_to_dataset(table, root_path='dataset_name',
> partition_cols=['one'])
> # Reading the partitioned dataset with schema not including partitioned names
> # will error
> schema = pa.schema([("three", "double")])
> data = ds.dataset("dataset_name", partitioning="hive", schema=schema)
> subset = ds.field("one") == 2.5
> data.to_table(filter=subset)
> # And will not if done like so:
> schema = pa.schema([("three", "double"), ("one", "double")])
> data = ds.dataset("dataset_name", partitioning="hive", schema=schema)
> subset = ds.field("one") == 2.5
> data.to_table(filter=subset)
> {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)