[ 
https://issues.apache.org/jira/browse/ARROW-15643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490148#comment-17490148
 ] 

&res commented on ARROW-15643:
------------------------------

Thanks for raising the issue.

I've noticed that you can't cast a struct array to a sub set of the struct. So 
for example:
{code:python}
import pyarrow as pa


struct_type = pa.struct(
    [pa.field("field1", pa.string()), pa.field("field2", pa.int32())]
)

sub_struct_type = pa.struct(
    [
        pa.field("field1", pa.string()),
    ]
)


struct_array = pa.array(
    [
        ("ABC", 123),
        ("EFG", 456),
    ],
    struct_type,
)

struct_array.cast(sub_struct_type)

{code}

Gives you:
{code}

    return call_function("cast", [arr], options)
  File "pyarrow/_compute.pyx", line 527, in pyarrow._compute.call_function
  File "pyarrow/_compute.pyx", line 337, in pyarrow._compute.Function.call
  File "pyarrow/error.pxi", line 143, in 
pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 120, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unsupported cast from struct<field1: 
string, field2: int32> to struct using function cast_struct

{code}

So one option would be to support this type of cast.

> [C++] Kernel to select subset of fields of a StructArray
> --------------------------------------------------------
>
>                 Key: ARROW-15643
>                 URL: https://issues.apache.org/jira/browse/ARROW-15643
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>
> Triggered by 
> https://stackoverflow.com/questions/71035754/pyarrow-drop-a-column-in-a-nested-structure.
>  I thought there was already an issue about this, but don't directly find one.
> Assume you have a struct array with some fields:
> {code}
> >>> arr = pa.StructArray.from_arrays([[1, 2, 3]]*3, names=['a', 'b', 'c'])
> >>> arr.type
> StructType(struct<a: int64, b: int64, c: int64>)
> {code}
> We have a kernel to select a single child field:
> {code}
> >>> pc.struct_field(arr, [0])
> <pyarrow.lib.Int64Array object at 0x7ffa9e229940>
> [
>   1,
>   2,
>   3
> ]
> {code}
> But if you want to subset the StructArray to some of its fields, resulting in 
> a new StructArray, that's not possible with {{struct_field}}, and doing this 
> manually is a bit cumbersome:
> {code}
> >>> fields = ['a', 'c']
> >>> arrays = [arr.field(n) for n in fields]
> >>> arr_subset = pa.StructArray.from_arrays(arrays, names=fields)
> >>> arr_subset.type
> StructType(struct<a: int64, c: int64>)
> {code}
> (this is still OK, but if you had a ChunkedArray, it certainly gets annoying)
> One option could be to expand the existing {{struct_field}} to allow 
> selecting multiple fields (although that probably gets ambigous/confusing 
> with how you currently select a recursively nested field -> [0, 1] currently 
> means "first child, second subchild" and not "first and second child"). 
> Or a new kernel like "struct_subset" or some other name.
> This might also overlap with general projection functionality? (cc 
> [~westonpace])



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to