[
https://issues.apache.org/jira/browse/ARROW-15643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17512359#comment-17512359
]
David Li commented on ARROW-15643:
----------------------------------
I believe this should be a cast. For one, that means it will automatically make
scanning better!
We can tackle the unambiguous cases first, and work on the ambiguous cases
later. For instance, subsetting fields without changing order should be
reasonable. We can later add a field to also allow reordering, and to handle
various ambiguous cases that Will raised.
IMO, "safe" isn't about copying (all kernels copy, basically, though it would
be good to optimize out copies for the struct fields if there's no type
conversion), but is about whether the cast may produce invalid data or not, and
whether the kernel should error or not. That isn't a concern here, it'll be
passed down to the casts for the child fields.
> [C++] Kernel to select subset of fields of a StructArray
> --------------------------------------------------------
>
> Key: ARROW-15643
> URL: https://issues.apache.org/jira/browse/ARROW-15643
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Joris Van den Bossche
> Priority: Major
> Labels: kernel
>
> Triggered by
> https://stackoverflow.com/questions/71035754/pyarrow-drop-a-column-in-a-nested-structure.
> I thought there was already an issue about this, but don't directly find one.
> Assume you have a struct array with some fields:
> {code}
> >>> arr = pa.StructArray.from_arrays([[1, 2, 3]]*3, names=['a', 'b', 'c'])
> >>> arr.type
> StructType(struct<a: int64, b: int64, c: int64>)
> {code}
> We have a kernel to select a single child field:
> {code}
> >>> pc.struct_field(arr, [0])
> <pyarrow.lib.Int64Array object at 0x7ffa9e229940>
> [
> 1,
> 2,
> 3
> ]
> {code}
> But if you want to subset the StructArray to some of its fields, resulting in
> a new StructArray, that's not possible with {{struct_field}}, and doing this
> manually is a bit cumbersome:
> {code}
> >>> fields = ['a', 'c']
> >>> arrays = [arr.field(n) for n in fields]
> >>> arr_subset = pa.StructArray.from_arrays(arrays, names=fields)
> >>> arr_subset.type
> StructType(struct<a: int64, c: int64>)
> {code}
> (this is still OK, but if you had a ChunkedArray, it certainly gets annoying)
> One option could be to expand the existing {{struct_field}} to allow
> selecting multiple fields (although that probably gets ambigous/confusing
> with how you currently select a recursively nested field -> [0, 1] currently
> means "first child, second subchild" and not "first and second child").
> Or a new kernel like "struct_subset" or some other name.
> This might also overlap with general projection functionality? (cc
> [~westonpace])
--
This message was sent by Atlassian Jira
(v8.20.1#820001)