[jira] [Commented] (ARROW-14596) [Python] parquet.read_table nested fields in columns does not work for use_legacy_dataset=False

Alenka Frim (Jira) Fri, 22 Apr 2022 07:00:16 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-14596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17526407#comment-17526407
 ]


Alenka Frim commented on ARROW-14596:
-------------------------------------

I would like to add observations we got today when pairing with 
[~jorisvandenbossche] on this topic.

First was the result of using {{pq.read_table}} with legacy implementation vs 
using {{ds.dataset}} with column projection. The data can get selected 
correctly with the dataset implementation but what happens is that the 
structure of a nested field is not kept (from struct it is flattened to string 
column). In case of using columns selection with a list in  {{{}ds.dataset{}}}, 
it errors, as reported in the issue.
{code:python}
>>> import pandas as pd
>>> import pyarrow as pa
>>> import pyarrow.parquet as pq
>>> 
>>> df = pd.DataFrame({
...     'user_id': ['abc123', 'qrs456'],
...     'interaction': [{'type': 'click', 'element': 'button'}, 
{'type':'scroll', 'element': 'window'}]
... })
>>> 
>>> table = pa.Table.from_pandas(df)
>>> pq.write_table(table, 'example.parquet')
{code}
{code:python}
>>> pq.read_table('example.parquet', columns = ['user_id', 'interaction.type'], 
>>> use_legacy_dataset = True)
pyarrow.Table
user_id: string
interaction: struct<type: string>
  child 0, type: string
----
user_id: [["abc123","qrs456"]]
interaction: [
  -- is_valid: all not null
  -- child 0 type: string
["click","scroll"]]
{code}
{code:python}
>>> import pyarrow.dataset as ds
>>> projection = {
...         'user_id': ds.field('user_id'),
...         'new': ds.field(('interaction', 'type'))
...     }
>>> ds.dataset('example.parquet').to_table(columns=projection)
pyarrow.Table
user_id: string
new: string
----
user_id: [["abc123","qrs456"]]
new: [["click","scroll"]]
{code}
{code:python}
>>> ds.dataset('example.parquet').to_table(columns=['user_id', 
>>> 'interaction.type'])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pyarrow/_dataset.pyx", line 303, in pyarrow._dataset.Dataset.to_table
    return self.scanner(**kwargs).to_table()
  File "pyarrow/_dataset.pyx", line 270, in pyarrow._dataset.Dataset.scanner
    return Scanner.from_dataset(self, **kwargs)
  File "pyarrow/_dataset.pyx", line 2322, in 
pyarrow._dataset.Scanner.from_dataset
    _populate_builder(builder, columns=columns, filter=filter,
  File "pyarrow/_dataset.pyx", line 2168, in pyarrow._dataset._populate_builder
    check_status(builder.ProjectColumns([tobytes(c) for c in columns]))
  File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
    raise ArrowInvalid(message)
pyarrow.lib.ArrowInvalid: No match for FieldRef.Name(interaction.type) in 
user_id: string
interaction: struct<element: string, type: string>
__fragment_index: int32
__batch_index: int32
__last_in_fragment: bool
__filename: string
/Users/alenkafrim/repos/arrow/cpp/src/arrow/type.h:1722  CheckNonEmpty(matches, 
root)
/Users/alenkafrim/repos/arrow/cpp/src/arrow/type.h:1757  FindOne(root)
/Users/alenkafrim/repos/arrow/cpp/src/arrow/dataset/scanner.cc:714  
ref->GetOne(dataset_schema)
/Users/alenkafrim/repos/arrow/cpp/src/arrow/dataset/scanner.cc:784  
ProjectionDescr::FromNames(std::move(columns), *scan_options_->dataset_schema)
{code}
When Scanner object is being created from the dataset class via {{to_table}} 
and (through _populate_builder) and in the case of a list of columns the 
{{ProjectColumns}} method ("arrow::dataset::ScannerBuilder") is being called it 
only accepts string column names and errors when a column is a struct.

We were thinking if it would be a good idea to add a new method in 
{{scanner.cc}} that would mimic {{FromNames}} method but takes {{field_ref}} as 
an argument? Afterwords there would also be a need to recreate a struct field 
for which we are not sure how to approach.

cc [~westonpace] [~apitrou] do you think that would be a correct way to go?

> [Python] parquet.read_table nested fields in columns does not work for 
> use_legacy_dataset=False
> -----------------------------------------------------------------------------------------------
>
>                 Key: ARROW-14596
>                 URL: https://issues.apache.org/jira/browse/ARROW-14596
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Tom Scheffers
>            Assignee: Alenka Frim
>            Priority: Critical
>             Fix For: 9.0.0
>
>
> Reading nested field does not work with use_legacy_dataset=False.
> This works:
>  
> {code:java}
> import pyarrow.parquet as pq
> t = pq.read_table(
>  source=*filename*,
>  columns=['store_key', 'properties.country'], 
>  use_legacy_dataset=True,
> ).to_pandas()
> {code}
> This does not work (for the same parquet file):
>  
> {code:java}
> import pyarrow.parquet as pq
> t = pq.read_table(
>  source=*filename*,
>  columns=['store_key', 'properties.country'], 
>  use_legacy_dataset=False,
> ).to_pandas(){code}
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

[jira] [Commented] (ARROW-14596) [Python] parquet.read_table nested fields in columns does not work for use_legacy_dataset=False

Reply via email to