[I] Range/inequality joins are slow [arrow-datafusion]

via GitHub Fri, 01 Dec 2023 04:51:10 -0800


simonvandel opened a new issue, #8393:
URL: https://github.com/apache/arrow-datafusion/issues/8393


   ### Describe the bug
   
   Joins where the `ON` filter are not equality, but rather inequalities like 
`<`, `> etc. seem slow. Atleast compared to DuckDB which seem like a direct 
"competitor".
   
   The main difference between the DuckDB and Datafusion plans seem to be that 
Datafusion uses a `NestedLoopJoinExec`, while DuckDB uses a `IEJoin`.
   
   Note that the query could be written better with a ASOF-join, but Datafusion 
does not support that (see issue 
https://github.com/apache/arrow-datafusion/issues/318).
   
   ### To Reproduce
   
   Create some test data with this SQL (saved as repro-dataset.sql) in DuckDB:
   ```sql
   CREATE
   OR REPLACE TABLE pricing AS
   SELECT
       t,
       RANDOM() as v
   FROM
       range(
           '2022-01-01' :: TIMESTAMP,
           '2023-01-01' :: TIMESTAMP,
           INTERVAL 30 DAY
       ) ts(t);
   
   COPY pricing to 'pricing.parquet' (format 'parquet');
   
   CREATE
   OR REPLACE TABLE timestamps AS
   SELECT
       t
   FROM
       range(
           '2022-01-01' :: TIMESTAMP,
           '2023-01-01' :: TIMESTAMP,
           INTERVAL 10 SECOND
       ) ts(t);
   
   COPY timestamps to 'timestamps.parquet' (format 'parquet');
   ```
   
   ```shell
   $ duckdb < repro-dataset.sql
   ```
   
   We will compare the performance of the following query in DuckDB and 
Datafusion. The query is saved as `repro-range-query.sql`.
   
   ```sql
   WITH pricing_state AS (
       SELECT
           t as valid_from,
           COALESCE(
               LEAD(t, 1) OVER (
                   ORDER BY
                       t
               ),
               '9999-12-31'
           ) as valid_to,
           v
       FROM
           'pricing.parquet'
   )
   SELECT
       t.t,
       p.v
   FROM
       pricing_state p
       LEFT JOIN 'timestamps.parquet' t ON t.t BETWEEN p.valid_from
       AND p.valid_to;
   ```
   
   **DuckDB performance:**
   ```shell
   $ time duckdb < repro-range-query.sql
   ...
   real    0m0.999s
   user    0m6.070s
   sys     0m3.600s
   ```
   
   **Datafusion performance:**
   ```shell
   $ time datafusion-cli -f repro-range-query.sql
   ...
   real    0m8.269s
   user    0m6.358s
   sys     0m1.907s
   ```
   
   
   
   
   ### Expected behavior
   
   It would be nice if the above query (or something equivalent) would be 
faster in Datafusion.
   
   If someone knows of a better way to express the query, then that could also 
be a workaround for me.
   
   ### Additional context
   
   **Machine tested on**:
   CPU:Ryzen 3900x
   OS: Ubuntu 22.04
   
   **Versions used:**
   ```shell
   $ duckdb --version
   v0.9.2 3c695d7ba9
   ```
   
   ```shell
   $ datafusion-cli --version
   datafusion-cli 33.0.0
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Range/inequality joins are slow [arrow-datafusion]

Reply via email to