[
https://issues.apache.org/jira/browse/ARROW-15643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17492843#comment-17492843
]
Weston Pace commented on ARROW-15643:
-------------------------------------
We have a concrete use case for the "subset" version in scanning. Users can
specify nested refs which can be satisfied in the parquet reader but not the
CSV reader. So for the CSV case we need to be able to read in the full column
and then cast down to the targetted struct the user is asking for in the nested
ref.
I don't know about reordering but it might be needed for Substrait to support
their emit property which I think can arbitrarily reorder columns, both at the
batch level and any nested level in a struct.
I'm not sure what the rationale is for the "safe" flag. Are you saying it
might be nice for users to say "do this cast if it can be done zero-copy but
fail otherwise"?
> [C++] Kernel to select subset of fields of a StructArray
> --------------------------------------------------------
>
> Key: ARROW-15643
> URL: https://issues.apache.org/jira/browse/ARROW-15643
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Joris Van den Bossche
> Priority: Major
> Labels: kernel
>
> Triggered by
> https://stackoverflow.com/questions/71035754/pyarrow-drop-a-column-in-a-nested-structure.
> I thought there was already an issue about this, but don't directly find one.
> Assume you have a struct array with some fields:
> {code}
> >>> arr = pa.StructArray.from_arrays([[1, 2, 3]]*3, names=['a', 'b', 'c'])
> >>> arr.type
> StructType(struct<a: int64, b: int64, c: int64>)
> {code}
> We have a kernel to select a single child field:
> {code}
> >>> pc.struct_field(arr, [0])
> <pyarrow.lib.Int64Array object at 0x7ffa9e229940>
> [
> 1,
> 2,
> 3
> ]
> {code}
> But if you want to subset the StructArray to some of its fields, resulting in
> a new StructArray, that's not possible with {{struct_field}}, and doing this
> manually is a bit cumbersome:
> {code}
> >>> fields = ['a', 'c']
> >>> arrays = [arr.field(n) for n in fields]
> >>> arr_subset = pa.StructArray.from_arrays(arrays, names=fields)
> >>> arr_subset.type
> StructType(struct<a: int64, c: int64>)
> {code}
> (this is still OK, but if you had a ChunkedArray, it certainly gets annoying)
> One option could be to expand the existing {{struct_field}} to allow
> selecting multiple fields (although that probably gets ambigous/confusing
> with how you currently select a recursively nested field -> [0, 1] currently
> means "first child, second subchild" and not "first and second child").
> Or a new kernel like "struct_subset" or some other name.
> This might also overlap with general projection functionality? (cc
> [~westonpace])
--
This message was sent by Atlassian Jira
(v8.20.1#820001)