Lei (Eddy) Xu created ARROW-17540:
-------------------------------------
Summary: [Python] Can not refer to field in a list of structs
Key: ARROW-17540
URL: https://issues.apache.org/jira/browse/ARROW-17540
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 9.0.0
Reporter: Lei (Eddy) Xu
When the dataset has nested sturcts, "list<struct>", we can not use
`pyarrow.field(..)` to get the reference of the sub-field of the struct.
For example
{code:python}
import pyarrow as pa
import pyarrow.dataset as ds
import pandas as pd
schema = pa.schema(
[
pa.field(
"objects",
pa.list_(
pa.struct(
[
pa.field("name", pa.utf8()),
pa.field("attr1", pa.float32()),
pa.field("attr2", pa.int32()),
]
)
),
)
]
)
table = pa.Table.from_pandas(
pd.DataFrame([{"objects": [{"name": "a", "attr1": 5.0, "attr2": 20}]}])
)
print(table)
dataset = ds.dataset(table)
print(dataset)
dataset.scanner(columns=["objects.attr2"]).to_table()
{code}
which throws exception:
{noformat}
Traceback (most recent call last):
File "foo.py", line 31, in <module>
dataset.scanner(columns=["objects.attr2"]).to_table()
File "pyarrow/_dataset.pyx", line 298, in pyarrow._dataset.Dataset.scanner
File "pyarrow/_dataset.pyx", line 2356, in
pyarrow._dataset.Scanner.from_dataset
File "pyarrow/_dataset.pyx", line 2202, in pyarrow._dataset._populate_builder
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: No match for FieldRef.Name(objects.attr2) in objects:
list<item: struct<attr1: double, attr2: int64, name: string>>
__fragment_index: int32
__batch_index: int32
__last_in_fragment: bool
__filename: string
{noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)