[
https://issues.apache.org/jira/browse/ARROW-14596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526407#comment-17526407
]
Alenka Frim commented on ARROW-14596:
-------------------------------------
I would like to add observations we got today when pairing with
[~jorisvandenbossche] on this topic.
First was the result of using {{pq.read_table}} with legacy implementation vs
using {{ds.dataset}} with column projection. The data can get selected
correctly with the dataset implementation but what happens is that the
structure of a nested field is not kept (from struct it is flattened to string
column). In case of using columns selection with a list in {{{}ds.dataset{}}},
it errors, as reported in the issue.
{code:python}
>>> import pandas as pd
>>> import pyarrow as pa
>>> import pyarrow.parquet as pq
>>>
>>> df = pd.DataFrame({
... 'user_id': ['abc123', 'qrs456'],
... 'interaction': [{'type': 'click', 'element': 'button'},
{'type':'scroll', 'element': 'window'}]
... })
>>>
>>> table = pa.Table.from_pandas(df)
>>> pq.write_table(table, 'example.parquet')
{code}
{code:python}
>>> pq.read_table('example.parquet', columns = ['user_id', 'interaction.type'],
>>> use_legacy_dataset = True)
pyarrow.Table
user_id: string
interaction: struct<type: string>
child 0, type: string
----
user_id: [["abc123","qrs456"]]
interaction: [
-- is_valid: all not null
-- child 0 type: string
["click","scroll"]]
{code}
{code:python}
>>> import pyarrow.dataset as ds
>>> projection = {
... 'user_id': ds.field('user_id'),
... 'new': ds.field(('interaction', 'type'))
... }
>>> ds.dataset('example.parquet').to_table(columns=projection)
pyarrow.Table
user_id: string
new: string
----
user_id: [["abc123","qrs456"]]
new: [["click","scroll"]]
{code}
{code:python}
>>> ds.dataset('example.parquet').to_table(columns=['user_id',
>>> 'interaction.type'])
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "pyarrow/_dataset.pyx", line 303, in pyarrow._dataset.Dataset.to_table
return self.scanner(**kwargs).to_table()
File "pyarrow/_dataset.pyx", line 270, in pyarrow._dataset.Dataset.scanner
return Scanner.from_dataset(self, **kwargs)
File "pyarrow/_dataset.pyx", line 2322, in
pyarrow._dataset.Scanner.from_dataset
_populate_builder(builder, columns=columns, filter=filter,
File "pyarrow/_dataset.pyx", line 2168, in pyarrow._dataset._populate_builder
check_status(builder.ProjectColumns([tobytes(c) for c in columns]))
File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
raise ArrowInvalid(message)
pyarrow.lib.ArrowInvalid: No match for FieldRef.Name(interaction.type) in
user_id: string
interaction: struct<element: string, type: string>
__fragment_index: int32
__batch_index: int32
__last_in_fragment: bool
__filename: string
/Users/alenkafrim/repos/arrow/cpp/src/arrow/type.h:1722 CheckNonEmpty(matches,
root)
/Users/alenkafrim/repos/arrow/cpp/src/arrow/type.h:1757 FindOne(root)
/Users/alenkafrim/repos/arrow/cpp/src/arrow/dataset/scanner.cc:714
ref->GetOne(dataset_schema)
/Users/alenkafrim/repos/arrow/cpp/src/arrow/dataset/scanner.cc:784
ProjectionDescr::FromNames(std::move(columns), *scan_options_->dataset_schema)
{code}
When Scanner object is being created from the dataset class via {{to_table}}
and (through _populate_builder) and in the case of a list of columns the
{{ProjectColumns}} method ("arrow::dataset::ScannerBuilder") is being called it
only accepts string column names and errors when a column is a struct.
We were thinking if it would be a good idea to add a new method in
{{scanner.cc}} that would mimic {{FromNames}} method but takes {{field_ref}} as
an argument? Afterwords there would also be a need to recreate a struct field
for which we are not sure how to approach.
cc [~westonpace] [~apitrou] do you think that would be a correct way to go?
> [Python] parquet.read_table nested fields in columns does not work for
> use_legacy_dataset=False
> -----------------------------------------------------------------------------------------------
>
> Key: ARROW-14596
> URL: https://issues.apache.org/jira/browse/ARROW-14596
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Reporter: Tom Scheffers
> Assignee: Alenka Frim
> Priority: Critical
> Fix For: 9.0.0
>
>
> Reading nested field does not work with use_legacy_dataset=False.
> This works:
>
> {code:java}
> import pyarrow.parquet as pq
> t = pq.read_table(
> source=*filename*,
> columns=['store_key', 'properties.country'],
> use_legacy_dataset=True,
> ).to_pandas()
> {code}
> This does not work (for the same parquet file):
>
> {code:java}
> import pyarrow.parquet as pq
> t = pq.read_table(
> source=*filename*,
> columns=['store_key', 'properties.country'],
> use_legacy_dataset=False,
> ).to_pandas(){code}
>
--
This message was sent by Atlassian Jira
(v8.20.7#820007)