simonvandel opened a new issue, #8393: URL: https://github.com/apache/arrow-datafusion/issues/8393
### Describe the bug Joins where the `ON` filter are not equality, but rather inequalities like `<`, `> etc. seem slow. Atleast compared to DuckDB which seem like a direct "competitor". The main difference between the DuckDB and Datafusion plans seem to be that Datafusion uses a `NestedLoopJoinExec`, while DuckDB uses a `IEJoin`. Note that the query could be written better with a ASOF-join, but Datafusion does not support that (see issue https://github.com/apache/arrow-datafusion/issues/318). ### To Reproduce Create some test data with this SQL (saved as repro-dataset.sql) in DuckDB: ```sql CREATE OR REPLACE TABLE pricing AS SELECT t, RANDOM() as v FROM range( '2022-01-01' :: TIMESTAMP, '2023-01-01' :: TIMESTAMP, INTERVAL 30 DAY ) ts(t); COPY pricing to 'pricing.parquet' (format 'parquet'); CREATE OR REPLACE TABLE timestamps AS SELECT t FROM range( '2022-01-01' :: TIMESTAMP, '2023-01-01' :: TIMESTAMP, INTERVAL 10 SECOND ) ts(t); COPY timestamps to 'timestamps.parquet' (format 'parquet'); ``` ```shell $ duckdb < repro-dataset.sql ``` We will compare the performance of the following query in DuckDB and Datafusion. The query is saved as `repro-range-query.sql`. ```sql WITH pricing_state AS ( SELECT t as valid_from, COALESCE( LEAD(t, 1) OVER ( ORDER BY t ), '9999-12-31' ) as valid_to, v FROM 'pricing.parquet' ) SELECT t.t, p.v FROM pricing_state p LEFT JOIN 'timestamps.parquet' t ON t.t BETWEEN p.valid_from AND p.valid_to; ``` **DuckDB performance:** ```shell $ time duckdb < repro-range-query.sql ... real 0m0.999s user 0m6.070s sys 0m3.600s ``` **Datafusion performance:** ```shell $ time datafusion-cli -f repro-range-query.sql ... real 0m8.269s user 0m6.358s sys 0m1.907s ``` ### Expected behavior It would be nice if the above query (or something equivalent) would be faster in Datafusion. If someone knows of a better way to express the query, then that could also be a workaround for me. ### Additional context **Machine tested on**: CPU:Ryzen 3900x OS: Ubuntu 22.04 **Versions used:** ```shell $ duckdb --version v0.9.2 3c695d7ba9 ``` ```shell $ datafusion-cli --version datafusion-cli 33.0.0 ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
