[GitHub] [arrow-datafusion] andygrove opened a new issue, #2742: Investigate unexpected performance issue when adding a filter to reduce rows going into a join

GitBox Thu, 16 Jun 2022 17:22:16 -0700


andygrove opened a new issue, #2742:
URL: https://github.com/apache/arrow-datafusion/issues/2742


   **Is your feature request related to a problem or challenge? Please describe 
what you are trying to do.**
   
   I have a query that performs a join and the join key contains null values on 
one side. If I run a plain join I see this performance:
   
   ```
   $ time bdt query --sql "select count(*) from web_sales join customer on 
web_sales.ws_bill_customer_sk = customer.c_customer_sk" --table 
customer.parquet --table web_sales.parquet
   +-----------------+
   | COUNT(UInt8(1)) |
   +-----------------+
   | 71983355        |
   +-----------------+
   
   real 0m0.824s
   user 0m20.911s
   sys  0m2.494s
   ```
   
   If I try and help DataFusion out by filtering out the null values before the 
join, the query takes 4x longer:
   
   ```
   $ time bdt query --sql "select count(*) from web_sales join customer on 
web_sales.ws_bill_customer_sk = customer.c_customer_sk where 
web_sales.ws_bill_customer_sk is not null" --table customer.parquet --table 
web_sales.parquet
   +-----------------+
   | COUNT(UInt8(1)) |
   +-----------------+
   | 71983355        |
   +-----------------+
   
   real 0m3.413s
   user 0m52.227s
   sys  0m17.020s
   
   ```
   
   **Describe the solution you'd like**
   I plan on investigating this to understand why it is so slow and what we can 
do about it.
   
   I am guessing that the filter is expensive because it is creating new 
batches and maybe the coalesce step as well. The join  would just ignore the 
nulls and not create new batches so maybe that is just more efficient. Also, in 
this case it should be possible to push the filter down to the table scan but I 
am not sure that we have implemented that yet.
   
   **Describe alternatives you've considered**
   None
   
   **Additional context**
   None
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] andygrove opened a new issue, #2742: Investigate unexpected performance issue when adding a filter to reduce rows going into a join

Reply via email to