[
https://issues.apache.org/jira/browse/ARROW-15519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Weston Pace updated ARROW-15519:
--------------------------------
Description:
Right now some early runs with Arrowbench and the OT PR
(https://github.com/apache/arrow/pull/12100) shows that we spend a fair amount
of time in TPC-H queries on filter nodes. There are a few improvements we know
could be made to our filtering approach at the moment. I'm creating this
parent issue to help categorize and track those:
* -We can use a selection vector in our filters to reduce the amount of
materialization needed. While long term we may want to support a selection
vector throughout the exec plan a good start would be to use it when we
encounter a chain of filters to avoid excess materialization (e.g. x < 10 && x
> 5 && y < 20)-
* If a filter if very selective then we may end up outputting a lot of very
small batches. We could probably hold onto the data at the filter node until
we've accumulated enough rows for a decent sized batch.
* The filter node is currently creating new thread tasks instead of appending
its work onto an existing thread task.
* If we have a chain of filters we could potentially use runtime selectivity
statistics / estimates to reorder our filters so that the most selective
filters are evaluated first.
was:
Right now some early runs with Arrowbench and the OT PR
(https://github.com/apache/arrow/pull/12100) shows that we spend a fair amount
of time in TPC-H queries on filter nodes. There are a few improvements we know
could be made to our filtering approach at the moment. I'm creating this
parent issue to help categorize and track those:
* We can use a selection vector in our filters to reduce the amount of
materialization needed. While long term we may want to support a selection
vector throughout the exec plan a good start would be to use it when we
encounter a chain of filters to avoid excess materialization (e.g. x < 10 && x
> 5 && y < 20)
* If a filter if very selective then we may end up outputting a lot of very
small batches. We could probably hold onto the data at the filter node until
we've accumulated enough rows for a decent sized batch.
* The filter node is currently creating new thread tasks instead of appending
its work onto an existing thread task.
* If we have a chain of filters we could potentially use runtime selectivity
statistics / estimates to reorder our filters so that the most selective
filters are evaluated first.
> [C++] Investigate potential performance improvments for the filter node
> -----------------------------------------------------------------------
>
> Key: ARROW-15519
> URL: https://issues.apache.org/jira/browse/ARROW-15519
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Weston Pace
> Priority: Major
>
> Right now some early runs with Arrowbench and the OT PR
> (https://github.com/apache/arrow/pull/12100) shows that we spend a fair
> amount of time in TPC-H queries on filter nodes. There are a few
> improvements we know could be made to our filtering approach at the moment.
> I'm creating this parent issue to help categorize and track those:
> * -We can use a selection vector in our filters to reduce the amount of
> materialization needed. While long term we may want to support a selection
> vector throughout the exec plan a good start would be to use it when we
> encounter a chain of filters to avoid excess materialization (e.g. x < 10 &&
> x > 5 && y < 20)-
> * If a filter if very selective then we may end up outputting a lot of very
> small batches. We could probably hold onto the data at the filter node until
> we've accumulated enough rows for a decent sized batch.
> * The filter node is currently creating new thread tasks instead of
> appending its work onto an existing thread task.
> * If we have a chain of filters we could potentially use runtime selectivity
> statistics / estimates to reorder our filters so that the most selective
> filters are evaluated first.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)