[ 
https://issues.apache.org/jira/browse/ARROW-17349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Will Jones reassigned ARROW-17349:
----------------------------------

    Assignee: Will Jones

> [C++] Support casting field names of list and map when nested
> -------------------------------------------------------------
>
>                 Key: ARROW-17349
>                 URL: https://issues.apache.org/jira/browse/ARROW-17349
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>    Affects Versions: 9.0.0
>            Reporter: Will Jones
>            Assignee: Will Jones
>            Priority: Major
>              Labels: good-first-issue, kernel, pull-request-available
>             Fix For: 10.0.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Different parquet implementations use different field names for internal 
> fields of ListType and MapType, which can sometimes cause silly conflicts. 
> For example, we use {{item}} as the field name for list, but Spark uses 
> {{element}}. Fortunately, we can automatically cast between List and Map 
> Types with different field names. Unfortunately, it only works at the top 
> level. We should get it to work at arbitrary levels of nesting.
> This was discovered in delta-rs: 
> https://github.com/delta-io/delta-rs/pull/684#discussion_r935099285
> Here's a reproduction in Python:
> {code:Python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pyarrow.dataset as ds
> def roundtrip_scanner(in_arr, out_type):
>     table = pa.table({"arr": in_arr})
>     pq.write_table(table, "test.parquet")
>     schema = pa.schema({"arr": out_type})
>     ds.dataset("test.parquet", schema=schema).to_table()
> # MapType
> ty_named = pa.map_(pa.field("x", pa.int32(), nullable=False), pa.int32())
> ty = pa.map_(pa.int32(), pa.int32())
> arr_named = pa.array([[(1, 2), (2, 4)]], type=ty_named)
> roundtrip_scanner(arr_named, ty)
> # ListType
> ty_named = pa.list_(pa.field("x", pa.int32(), nullable=False))
> ty = pa.list_(pa.int32())
> arr_named = pa.array([[1, 2, 4]], type=ty_named)
> roundtrip_scanner(arr_named, ty)
> # Combination MapType and ListType
> ty_named = pa.map_(pa.string(), pa.field("x", pa.list_(pa.field("x", 
> pa.int32(), nullable=True)), nullable=False))
> ty = pa.map_(pa.string(), pa.list_(pa.int32()))
> arr_named = pa.array([[("string", [1, 2, 3])]], type=ty_named)
> roundtrip_scanner(arr_named, ty)
> # Traceback (most recent call last):
> #   File "<stdin>", line 1, in <module>
> #   File "<stdin>", line 5, in roundtrip_scanner
> #   File "pyarrow/_dataset.pyx", line 331, in 
> pyarrow._dataset.Dataset.to_table
> #   File "pyarrow/_dataset.pyx", line 2577, in 
> pyarrow._dataset.Scanner.to_table
> #   File "pyarrow/error.pxi", line 144, in 
> pyarrow.lib.pyarrow_internal_check_status
> #   File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
> # pyarrow.lib.ArrowNotImplementedError: Unsupported cast to map<string, 
> list<item: int32>> from map<string, list<x: int32> ('arr')>
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to