[ 
https://issues.apache.org/jira/browse/ARROW-15311?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joris Van den Bossche updated ARROW-15311:
------------------------------------------
    Component/s: Python

> [Python][Docs] Opening a partitioned dataset with schema and filter
> -------------------------------------------------------------------
>
>                 Key: ARROW-15311
>                 URL: https://issues.apache.org/jira/browse/ARROW-15311
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Documentation, Python
>            Reporter: Alenka Frim
>            Priority: Major
>              Labels: docs, python
>
> Add a note to the docs that if partitioning and schema are both specified at 
> opening of a dataset and partitioning names are not included in the data, 
> schema needs to include the partitioning names (directory or hive 
> partitioning) in a case that filtering will be done.
> Example:
> {code:python}
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pyarrow.dataset as ds
> # Define the data
> table = pa.table({'one': [-1, np.nan, 2.5],
>                    'two': ['foo', 'bar', 'baz'],
>                    'three': [True, False, True]})
> # Write to partitioned dataset
> # The files will include columns "two" and "three"
> pq.write_to_dataset(table, root_path='dataset_name',
>                     partition_cols=['one'])
> # Reading the partitioned dataset with schema not including partitioned names
> # will error
> schema = pa.schema([("three", "double")])
> data = ds.dataset("dataset_name", partitioning="hive", schema=schema)
> subset = ds.field("one") == 2.5
> data.to_table(filter=subset)
> # And will not if done like so:
> schema = pa.schema([("three", "double"), ("one", "double")])
> data = ds.dataset("dataset_name", partitioning="hive", schema=schema)
> subset = ds.field("one") == 2.5
> data.to_table(filter=subset)
> {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to