[
https://issues.apache.org/jira/browse/ARROW-17228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
ASF GitHub Bot updated ARROW-17228:
-----------------------------------
Labels: pull-request-available (was: )
> dataset.write_data should use Scanner.projected_schema when passed a scanner
> with projected columns
> ---------------------------------------------------------------------------------------------------
>
> Key: ARROW-17228
> URL: https://issues.apache.org/jira/browse/ARROW-17228
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 8.0.0
> Environment: Python 3.9.13
> pyarrow 8.0.0
> Reporter: &res
> Priority: Minor
> Labels: pull-request-available
> Time Spent: 10m
> Remaining Estimate: 0h
>
> In the code below:
> {code:java}
> import pyarrow as pa
> import pyarrow.dataset as ds
> table = pa.Table.from_arrays(
> [
> pa.array(['a', 'b', 'c'], pa.string()),
> pa.array(['a', 'b', 'c'], pa.string()),
> ],
> names=['region', "Other"]
> )
> table_dataset = ds.dataset(table)
> columns = {
> "Region": ds.field('region'),
> "Other": ds.field('Other'),
> }
> scanner = table_dataset.scanner(columns=columns)
> ds.write_dataset(
> scanner,
> 'newpath',
> partitioning=['Region'], partitioning_flavor='hive',
> format='parquet')
> {code}
> I get this exception:
> {code:java}
> KeyError: 'Column Region does not exist in schema'
> {code}
> I suspect it is because write_dataset isn't looking at the correct schema. It
> should look at scanner.project_schema (rather than scanner.dataset_schema).
> I think it's just a matter of updating this line:
> https://github.com/apache/arrow/blob/bc6c4988691cf60ecac67542b2daa2ac19fde5d9/python/pyarrow/dataset.py#L967
>
> The issue was raised here:
> https://stackoverflow.com/questions/73139467/how-to-incorporate-projected-columns-in-scanner-into-new-dataset-partitioning
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)