Re: [I] [C++][Compute] Add ExecNode(s) for vectorized take with computed indices [arrow]

via GitHub Wed, 20 Dec 2023 05:48:25 -0800


jorisvandenbossche commented on issue #39311:
URL: https://github.com/apache/arrow/issues/39311#issuecomment-1864504639


   For me there is a bit of disconnect between the issue title and description. 
The title mentions a "take" kernel, but "take" is an existing vector kernel (to 
take values from a table/array based on pre-computed indices). I don't you can 
use the "take" kernel at the moment in an Aceor plan context (there is a 
FilterNode that does a take under the hood, but that's selecting rows based on 
a predicate _expression_, not based on pre-computed indices). But I assume this 
is not the kind of "take" you mean here?
   
   My understanding is that you rather want some scalar take-like kernel that 
works on a list of elements and takes certain values from that list? So 
essentially like the existing "list_element(lists, index)" kernel, but then 
allowing multiple indices instead of only a single index?
   
   The "list_element" almost works in the case you only want to select one 
element of the list:
   
   ```python
   import pyarrow as pa
   from pyarrow.acero import Declaration, TableSourceNodeOptions, 
ProjectNodeOptions
   import pyarrow.compute as pc
   
   table = pa.table({'a': [[1, 2, 3], [1, 2, 3], [1, 2, 3]], 'b': [0, 2, 1]})
   
   decl = Declaration.from_sequence([
       Declaration("table_source", TableSourceNodeOptions(table)),
       Declaration("project", 
ProjectNodeOptions([pc.list_element(pc.field("a"), pc.field("b"))]))
   ])
   decl.to_table()
   ```
   
   The only reason this doesn't work is because the "list_element" kernel isn't 
yet implemented for the variant where the second argument (the index) is an 
array instead of a scalar (i.e. constant index for all rows). 
   You could have something similar for a potential "list_take" kernel.
   
   This does assume you first have computed those lists and indices. So that 
means that you always first materialize the full list in a groupby step, and 
only in a second step then take a subset of the list. I am not sure there would 
be a way to integrate that in a single groupby call (without using a custom UDF 
for the groupby aggregation).
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [C++][Compute] Add ExecNode(s) for vectorized take with computed indices [arrow]

Reply via email to