alamb commented on issue #6692: URL: https://github.com/apache/arrow-rs/issues/6692#issuecomment-2786466512
> Perhaps it might be worth thinking about what use-cases we're trying to improve with this effort, this will ensure we design something that adequately addresses that use-case? The core usecase is my mind is to save a copy (and associated allocation overheads, etc) when building up an output array from subsets of multiple input arrays. Today this operation requires using a two step process with two kernels: 1. `filter` or `take` --> intermediate and then `concat` to form the output In certain queries in DataFusion this copying show up in profiles. For example, it appears in queries with relatively unselective filters that involve Strings, [such as these clickbench queries](https://github.com/apache/datafusion/blob/784df33f8930f91eada0d67aa5acc25a4c25cea2/benchmarks/queries/clickbench/queries.sql#L25-L28) where the predicate `SearchPhrase <> ''` passes the long strings through. Example ```sql SELECT "SearchPhrase" FROM hits WHERE "SearchPhrase" <> '' ORDER BY "EventTime" LIMIT 10; ``` The theory is that by eliminating the intermediate copy and build the desired output array directly we will improve performance -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
