Will Jones created ARROW-17349:
----------------------------------

             Summary: [C++] Support casting field names of list and map when 
nested
                 Key: ARROW-17349
                 URL: https://issues.apache.org/jira/browse/ARROW-17349
             Project: Apache Arrow
          Issue Type: Improvement
          Components: C++
    Affects Versions: 9.0.0
            Reporter: Will Jones
             Fix For: 10.0.0


Different parquet implementations use different field names for internal fields 
of ListType and MapType, which can sometimes cause silly conflicts. For 
example, we use {{item}} as the field name for list, but Spark uses 
{{element}}. Fortunately, we can automatically cast between List and Map Types 
with different field names. Unfortunately, it only works at the top level. We 
should get it to work at arbitrary levels of nesting.

This was discovered in delta-rs: 
https://github.com/delta-io/delta-rs/pull/684#discussion_r935099285

Here's a reproduction in Python:

{code:Python}
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

def roundtrip_scanner(in_arr, out_type):
    table = pa.table({"arr": in_arr})
    pq.write_table(table, "test.parquet")
    schema = pa.schema({"arr": out_type})
    ds.dataset("test.parquet", schema=schema).to_table()

# MapType
ty_named = pa.map_(pa.field("x", pa.int32(), nullable=False), pa.int32())
ty = pa.map_(pa.int32(), pa.int32())
arr_named = pa.array([[(1, 2), (2, 4)]], type=ty_named)
roundtrip_scanner(arr_named, ty)

# ListType
ty_named = pa.list_(pa.field("x", pa.int32(), nullable=False))
ty = pa.list_(pa.int32())
arr_named = pa.array([[1, 2, 4]], type=ty_named)
roundtrip_scanner(arr_named, ty)

# Combination MapType and ListType
ty_named = pa.map_(pa.string(), pa.field("x", pa.list_(pa.field("x", 
pa.int32(), nullable=True)), nullable=False))
ty = pa.map_(pa.string(), pa.list_(pa.int32()))
arr_named = pa.array([[("string", [1, 2, 3])]], type=ty_named)
roundtrip_scanner(arr_named, ty)
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
#   File "<stdin>", line 5, in roundtrip_scanner
#   File "pyarrow/_dataset.pyx", line 331, in pyarrow._dataset.Dataset.to_table
#   File "pyarrow/_dataset.pyx", line 2577, in pyarrow._dataset.Scanner.to_table
#   File "pyarrow/error.pxi", line 144, in 
pyarrow.lib.pyarrow_internal_check_status
#   File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
# pyarrow.lib.ArrowNotImplementedError: Unsupported cast to map<string, 
list<item: int32>> from map<string, list<x: int32> ('arr')>
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to