[ 
https://issues.apache.org/jira/browse/ARROW-17349?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17607422#comment-17607422
 ] 

Will Jones commented on ARROW-17349:
------------------------------------

What's actually going on is we don't have any cast kernel for Map. Casting from 
a map to map works, because we early return if types are equal, and our equals 
method doesn't care about map field names. But it does care about list field 
names, so if the map contains a list then it will look for a cast function.

I'll create a separate ticket for implementing Cast for Map, but for this 
particular issue, I think it would be nice to have a fast path for renaming 
fields in cast.

> [C++] Support casting field names of list and map when nested
> -------------------------------------------------------------
>
>                 Key: ARROW-17349
>                 URL: https://issues.apache.org/jira/browse/ARROW-17349
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>    Affects Versions: 9.0.0
>            Reporter: Will Jones
>            Assignee: Will Jones
>            Priority: Major
>              Labels: good-first-issue, kernel, pull-request-available
>             Fix For: 10.0.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Different parquet implementations use different field names for internal 
> fields of ListType and MapType, which can sometimes cause silly conflicts. 
> For example, we use {{item}} as the field name for list, but Spark uses 
> {{element}}. Fortunately, we can automatically cast between List and Map 
> Types with different field names. Unfortunately, it only works at the top 
> level. We should get it to work at arbitrary levels of nesting.
> This was discovered in delta-rs: 
> https://github.com/delta-io/delta-rs/pull/684#discussion_r935099285
> Here's a reproduction in Python:
> {code:Python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pyarrow.dataset as ds
> def roundtrip_scanner(in_arr, out_type):
>     table = pa.table({"arr": in_arr})
>     pq.write_table(table, "test.parquet")
>     schema = pa.schema({"arr": out_type})
>     ds.dataset("test.parquet", schema=schema).to_table()
> # MapType
> ty_named = pa.map_(pa.field("x", pa.int32(), nullable=False), pa.int32())
> ty = pa.map_(pa.int32(), pa.int32())
> arr_named = pa.array([[(1, 2), (2, 4)]], type=ty_named)
> roundtrip_scanner(arr_named, ty)
> # ListType
> ty_named = pa.list_(pa.field("x", pa.int32(), nullable=False))
> ty = pa.list_(pa.int32())
> arr_named = pa.array([[1, 2, 4]], type=ty_named)
> roundtrip_scanner(arr_named, ty)
> # Combination MapType and ListType
> ty_named = pa.map_(pa.string(), pa.field("x", pa.list_(pa.field("x", 
> pa.int32(), nullable=True)), nullable=False))
> ty = pa.map_(pa.string(), pa.list_(pa.int32()))
> arr_named = pa.array([[("string", [1, 2, 3])]], type=ty_named)
> roundtrip_scanner(arr_named, ty)
> # Traceback (most recent call last):
> #   File "<stdin>", line 1, in <module>
> #   File "<stdin>", line 5, in roundtrip_scanner
> #   File "pyarrow/_dataset.pyx", line 331, in 
> pyarrow._dataset.Dataset.to_table
> #   File "pyarrow/_dataset.pyx", line 2577, in 
> pyarrow._dataset.Scanner.to_table
> #   File "pyarrow/error.pxi", line 144, in 
> pyarrow.lib.pyarrow_internal_check_status
> #   File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
> # pyarrow.lib.ArrowNotImplementedError: Unsupported cast to map<string, 
> list<item: int32>> from map<string, list<x: int32> ('arr')>
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to