[
https://issues.apache.org/jira/browse/ARROW-15643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17490155#comment-17490155
]
Joris Van den Bossche commented on ARROW-15643:
-----------------------------------------------
In general we don't yet have implemented much casting support for structs (eg
ARROW-1888 to cast the fields to a different type, which is currently being
worked on. But I _suppose_ that PR currently allows the cast only for the same
field names and number of fields, i.e. only changing the type of the field).
But indeed, that would also be a way to support this functionality. I think
such a cast would be useful to allow in any case. But I also might not directly
think about using casting if I am looking to do a field selection (eg a Table
and RecordBatch have a {{select}} method, and RecordBatch and StructArray are
quite similar, so we could also have a StructArray.select method)
> [C++] Kernel to select subset of fields of a StructArray
> --------------------------------------------------------
>
> Key: ARROW-15643
> URL: https://issues.apache.org/jira/browse/ARROW-15643
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Joris Van den Bossche
> Priority: Major
>
> Triggered by
> https://stackoverflow.com/questions/71035754/pyarrow-drop-a-column-in-a-nested-structure.
> I thought there was already an issue about this, but don't directly find one.
> Assume you have a struct array with some fields:
> {code}
> >>> arr = pa.StructArray.from_arrays([[1, 2, 3]]*3, names=['a', 'b', 'c'])
> >>> arr.type
> StructType(struct<a: int64, b: int64, c: int64>)
> {code}
> We have a kernel to select a single child field:
> {code}
> >>> pc.struct_field(arr, [0])
> <pyarrow.lib.Int64Array object at 0x7ffa9e229940>
> [
> 1,
> 2,
> 3
> ]
> {code}
> But if you want to subset the StructArray to some of its fields, resulting in
> a new StructArray, that's not possible with {{struct_field}}, and doing this
> manually is a bit cumbersome:
> {code}
> >>> fields = ['a', 'c']
> >>> arrays = [arr.field(n) for n in fields]
> >>> arr_subset = pa.StructArray.from_arrays(arrays, names=fields)
> >>> arr_subset.type
> StructType(struct<a: int64, c: int64>)
> {code}
> (this is still OK, but if you had a ChunkedArray, it certainly gets annoying)
> One option could be to expand the existing {{struct_field}} to allow
> selecting multiple fields (although that probably gets ambigous/confusing
> with how you currently select a recursively nested field -> [0, 1] currently
> means "first child, second subchild" and not "first and second child").
> Or a new kernel like "struct_subset" or some other name.
> This might also overlap with general projection functionality? (cc
> [~westonpace])
--
This message was sent by Atlassian Jira
(v8.20.1#820001)