andygrove opened a new issue, #2742: URL: https://github.com/apache/arrow-datafusion/issues/2742
**Is your feature request related to a problem or challenge? Please describe what you are trying to do.** I have a query that performs a join and the join key contains null values on one side. If I run a plain join I see this performance: ``` $ time bdt query --sql "select count(*) from web_sales join customer on web_sales.ws_bill_customer_sk = customer.c_customer_sk" --table customer.parquet --table web_sales.parquet +-----------------+ | COUNT(UInt8(1)) | +-----------------+ | 71983355 | +-----------------+ real 0m0.824s user 0m20.911s sys 0m2.494s ``` If I try and help DataFusion out by filtering out the null values before the join, the query takes 4x longer: ``` $ time bdt query --sql "select count(*) from web_sales join customer on web_sales.ws_bill_customer_sk = customer.c_customer_sk where web_sales.ws_bill_customer_sk is not null" --table customer.parquet --table web_sales.parquet +-----------------+ | COUNT(UInt8(1)) | +-----------------+ | 71983355 | +-----------------+ real 0m3.413s user 0m52.227s sys 0m17.020s ``` **Describe the solution you'd like** I plan on investigating this to understand why it is so slow and what we can do about it. I am guessing that the filter is expensive because it is creating new batches and maybe the coalesce step as well. The join would just ignore the nulls and not create new batches so maybe that is just more efficient. Also, in this case it should be possible to push the filter down to the table scan but I am not sure that we have implemented that yet. **Describe alternatives you've considered** None **Additional context** None -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
