[
https://issues.apache.org/jira/browse/ARROW-17349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Will Jones reassigned ARROW-17349:
----------------------------------
Assignee: Will Jones
> [C++] Support casting field names of list and map when nested
> -------------------------------------------------------------
>
> Key: ARROW-17349
> URL: https://issues.apache.org/jira/browse/ARROW-17349
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Affects Versions: 9.0.0
> Reporter: Will Jones
> Assignee: Will Jones
> Priority: Major
> Labels: good-first-issue, kernel, pull-request-available
> Fix For: 10.0.0
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> Different parquet implementations use different field names for internal
> fields of ListType and MapType, which can sometimes cause silly conflicts.
> For example, we use {{item}} as the field name for list, but Spark uses
> {{element}}. Fortunately, we can automatically cast between List and Map
> Types with different field names. Unfortunately, it only works at the top
> level. We should get it to work at arbitrary levels of nesting.
> This was discovered in delta-rs:
> https://github.com/delta-io/delta-rs/pull/684#discussion_r935099285
> Here's a reproduction in Python:
> {code:Python}
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pyarrow.dataset as ds
> def roundtrip_scanner(in_arr, out_type):
> table = pa.table({"arr": in_arr})
> pq.write_table(table, "test.parquet")
> schema = pa.schema({"arr": out_type})
> ds.dataset("test.parquet", schema=schema).to_table()
> # MapType
> ty_named = pa.map_(pa.field("x", pa.int32(), nullable=False), pa.int32())
> ty = pa.map_(pa.int32(), pa.int32())
> arr_named = pa.array([[(1, 2), (2, 4)]], type=ty_named)
> roundtrip_scanner(arr_named, ty)
> # ListType
> ty_named = pa.list_(pa.field("x", pa.int32(), nullable=False))
> ty = pa.list_(pa.int32())
> arr_named = pa.array([[1, 2, 4]], type=ty_named)
> roundtrip_scanner(arr_named, ty)
> # Combination MapType and ListType
> ty_named = pa.map_(pa.string(), pa.field("x", pa.list_(pa.field("x",
> pa.int32(), nullable=True)), nullable=False))
> ty = pa.map_(pa.string(), pa.list_(pa.int32()))
> arr_named = pa.array([[("string", [1, 2, 3])]], type=ty_named)
> roundtrip_scanner(arr_named, ty)
> # Traceback (most recent call last):
> # File "<stdin>", line 1, in <module>
> # File "<stdin>", line 5, in roundtrip_scanner
> # File "pyarrow/_dataset.pyx", line 331, in
> pyarrow._dataset.Dataset.to_table
> # File "pyarrow/_dataset.pyx", line 2577, in
> pyarrow._dataset.Scanner.to_table
> # File "pyarrow/error.pxi", line 144, in
> pyarrow.lib.pyarrow_internal_check_status
> # File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
> # pyarrow.lib.ArrowNotImplementedError: Unsupported cast to map<string,
> list<item: int32>> from map<string, list<x: int32> ('arr')>
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)