[jira] [Updated] (ARROW-15519) [C++] Investigate potential performance improvments for the filter node

Weston Pace (Jira) Tue, 01 Feb 2022 14:52:06 -0800


     [ 
https://issues.apache.org/jira/browse/ARROW-15519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Weston Pace updated ARROW-15519:
--------------------------------
    Description: 
Right now some early runs with Arrowbench and the OT PR 
(https://github.com/apache/arrow/pull/12100) shows that we spend a fair amount 
of time in TPC-H queries on filter nodes.  There are a few improvements we know 
could be made to our filtering approach at the moment.  I'm creating this 
parent issue to help categorize and track those:

 * -We can use a selection vector in our filters to reduce the amount of 
materialization needed.  While long term we may want to support a selection 
vector throughout the exec plan a good start would be to use it when we 
encounter a chain of filters to avoid excess materialization (e.g. x < 10 && x 
> 5 && y < 20)-
 * If a filter if very selective then we may end up outputting a lot of very 
small batches.  We could probably hold onto the data at the filter node until 
we've accumulated enough rows for a decent sized batch.
 * The filter node is currently creating new thread tasks instead of appending 
its work onto an existing thread task.
 * If we have a chain of filters we could potentially use runtime selectivity 
statistics / estimates to reorder our filters so that the most selective 
filters are evaluated first.

  was:
Right now some early runs with Arrowbench and the OT PR 
(https://github.com/apache/arrow/pull/12100) shows that we spend a fair amount 
of time in TPC-H queries on filter nodes.  There are a few improvements we know 
could be made to our filtering approach at the moment.  I'm creating this 
parent issue to help categorize and track those:

 * We can use a selection vector in our filters to reduce the amount of 
materialization needed.  While long term we may want to support a selection 
vector throughout the exec plan a good start would be to use it when we 
encounter a chain of filters to avoid excess materialization (e.g. x < 10 && x 
> 5 && y < 20)
 * If a filter if very selective then we may end up outputting a lot of very 
small batches.  We could probably hold onto the data at the filter node until 
we've accumulated enough rows for a decent sized batch.
 * The filter node is currently creating new thread tasks instead of appending 
its work onto an existing thread task.
 * If we have a chain of filters we could potentially use runtime selectivity 
statistics / estimates to reorder our filters so that the most selective 
filters are evaluated first.


> [C++] Investigate potential performance improvments for the filter node
> -----------------------------------------------------------------------
>
>                 Key: ARROW-15519
>                 URL: https://issues.apache.org/jira/browse/ARROW-15519
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Weston Pace
>            Priority: Major
>
> Right now some early runs with Arrowbench and the OT PR 
> (https://github.com/apache/arrow/pull/12100) shows that we spend a fair 
> amount of time in TPC-H queries on filter nodes.  There are a few 
> improvements we know could be made to our filtering approach at the moment.  
> I'm creating this parent issue to help categorize and track those:
>  * -We can use a selection vector in our filters to reduce the amount of 
> materialization needed.  While long term we may want to support a selection 
> vector throughout the exec plan a good start would be to use it when we 
> encounter a chain of filters to avoid excess materialization (e.g. x < 10 && 
> x > 5 && y < 20)-
>  * If a filter if very selective then we may end up outputting a lot of very 
> small batches.  We could probably hold onto the data at the filter node until 
> we've accumulated enough rows for a decent sized batch.
>  * The filter node is currently creating new thread tasks instead of 
> appending its work onto an existing thread task.
>  * If we have a chain of filters we could potentially use runtime selectivity 
> statistics / estimates to reorder our filters so that the most selective 
> filters are evaluated first.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-15519) [C++] Investigate potential performance improvments for the filter node

Reply via email to