[ 
https://issues.apache.org/jira/browse/ARROW-15643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17492687#comment-17492687
 ] 

Joris Van den Bossche commented on ARROW-15643:
-----------------------------------------------

ARROW-1888 has been merged now, and currently the cast is "strict", meaning 
that it requires the exact same number of fields with the same names in the 
same order. If we want to support this issue through a cast, this could be 
relaxed to:

- allowing the fields of the target type to be a subset of the existing fields 
(but so no field names that are not present in the original array? Or also 
allow that in which case that field gets filled with nulls)
- also allowing them to be in a different order?

One thing I am wondering though, is whether we should consider this as a "safe" 
cast, or if we should add a new flag to the CastOptions to allow changing the 
fields of a struct 

> [C++] Kernel to select subset of fields of a StructArray
> --------------------------------------------------------
>
>                 Key: ARROW-15643
>                 URL: https://issues.apache.org/jira/browse/ARROW-15643
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: kernel
>
> Triggered by 
> https://stackoverflow.com/questions/71035754/pyarrow-drop-a-column-in-a-nested-structure.
>  I thought there was already an issue about this, but don't directly find one.
> Assume you have a struct array with some fields:
> {code}
> >>> arr = pa.StructArray.from_arrays([[1, 2, 3]]*3, names=['a', 'b', 'c'])
> >>> arr.type
> StructType(struct<a: int64, b: int64, c: int64>)
> {code}
> We have a kernel to select a single child field:
> {code}
> >>> pc.struct_field(arr, [0])
> <pyarrow.lib.Int64Array object at 0x7ffa9e229940>
> [
>   1,
>   2,
>   3
> ]
> {code}
> But if you want to subset the StructArray to some of its fields, resulting in 
> a new StructArray, that's not possible with {{struct_field}}, and doing this 
> manually is a bit cumbersome:
> {code}
> >>> fields = ['a', 'c']
> >>> arrays = [arr.field(n) for n in fields]
> >>> arr_subset = pa.StructArray.from_arrays(arrays, names=fields)
> >>> arr_subset.type
> StructType(struct<a: int64, c: int64>)
> {code}
> (this is still OK, but if you had a ChunkedArray, it certainly gets annoying)
> One option could be to expand the existing {{struct_field}} to allow 
> selecting multiple fields (although that probably gets ambigous/confusing 
> with how you currently select a recursively nested field -> [0, 1] currently 
> means "first child, second subchild" and not "first and second child"). 
> Or a new kernel like "struct_subset" or some other name.
> This might also overlap with general projection functionality? (cc 
> [~westonpace])



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to