Re: [I] Optimize take/filter from multiple input arrays to a single large output array [arrow-rs]

via GitHub Tue, 08 Apr 2025 06:36:07 -0700


alamb commented on issue #6692:
URL: https://github.com/apache/arrow-rs/issues/6692#issuecomment-2786466512


   > Perhaps it might be worth thinking about what use-cases we're trying to 
improve with this effort, this will ensure we design something that adequately 
addresses that use-case?
   
   The core usecase is my mind is to save a copy (and associated allocation 
overheads, etc) when building up an output array from subsets of multiple input 
arrays.
   
   Today this operation  requires using a two step process with two kernels:
   1.  `filter` or `take` --> intermediate and then `concat` to form the output
   
   In certain queries in DataFusion this copying show up in profiles. For 
example, it appears in queries with relatively unselective filters that involve 
Strings, [such as these clickbench 
queries](https://github.com/apache/datafusion/blob/784df33f8930f91eada0d67aa5acc25a4c25cea2/benchmarks/queries/clickbench/queries.sql#L25-L28)
 where the predicate `SearchPhrase <> ''` passes the long strings through.
   
   Example
   ```sql
   SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY 
"EventTime" LIMIT 10;
   ```
   
   The theory is that by eliminating the intermediate copy and build the 
desired output array directly we will improve performance
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Optimize take/filter from multiple input arrays to a single large output array [arrow-rs]

Reply via email to