[ 
https://issues.apache.org/jira/browse/ARROW-15643?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17492843#comment-17492843
 ] 

Weston Pace commented on ARROW-15643:
-------------------------------------

We have a concrete use case for the "subset" version in scanning.  Users can 
specify nested refs which can be satisfied in the parquet reader but not the 
CSV reader.  So for the CSV case we need to be able to read in the full column 
and then cast down to the targetted struct the user is asking for in the nested 
ref.

I don't know about reordering but it might be needed for Substrait to support 
their emit property which I think can arbitrarily reorder columns, both at the 
batch level and any nested level in a struct.

I'm not sure what the rationale is for the "safe" flag.  Are you saying it 
might be nice for users to say "do this cast if it can be done zero-copy but 
fail otherwise"?

> [C++] Kernel to select subset of fields of a StructArray
> --------------------------------------------------------
>
>                 Key: ARROW-15643
>                 URL: https://issues.apache.org/jira/browse/ARROW-15643
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Joris Van den Bossche
>            Priority: Major
>              Labels: kernel
>
> Triggered by 
> https://stackoverflow.com/questions/71035754/pyarrow-drop-a-column-in-a-nested-structure.
>  I thought there was already an issue about this, but don't directly find one.
> Assume you have a struct array with some fields:
> {code}
> >>> arr = pa.StructArray.from_arrays([[1, 2, 3]]*3, names=['a', 'b', 'c'])
> >>> arr.type
> StructType(struct<a: int64, b: int64, c: int64>)
> {code}
> We have a kernel to select a single child field:
> {code}
> >>> pc.struct_field(arr, [0])
> <pyarrow.lib.Int64Array object at 0x7ffa9e229940>
> [
>   1,
>   2,
>   3
> ]
> {code}
> But if you want to subset the StructArray to some of its fields, resulting in 
> a new StructArray, that's not possible with {{struct_field}}, and doing this 
> manually is a bit cumbersome:
> {code}
> >>> fields = ['a', 'c']
> >>> arrays = [arr.field(n) for n in fields]
> >>> arr_subset = pa.StructArray.from_arrays(arrays, names=fields)
> >>> arr_subset.type
> StructType(struct<a: int64, c: int64>)
> {code}
> (this is still OK, but if you had a ChunkedArray, it certainly gets annoying)
> One option could be to expand the existing {{struct_field}} to allow 
> selecting multiple fields (although that probably gets ambigous/confusing 
> with how you currently select a recursively nested field -> [0, 1] currently 
> means "first child, second subchild" and not "first and second child"). 
> Or a new kernel like "struct_subset" or some other name.
> This might also overlap with general projection functionality? (cc 
> [~westonpace])



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to