metesynnada commented on PR #7366:
URL:
https://github.com/apache/arrow-datafusion/pull/7366#issuecomment-1689863030
I created a small benchmark for streaming using tpch data.
```
Benchmark streaming.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query ┃ apache_main ┃ upstream_prunable-hash-join ┃ Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 1 │ 1483.13ms │ 1483.27ms │ no change │
│ QQuery 2 │ 11033.15ms │ 6903.41ms │ +1.60x faster │
└──────────────┴─────────────┴─────────────────────────────┴───────────────┘
```
First query is
```sql
SELECT
o_orderkey
FROM
orders,
lineitem
WHERE
o_orderdate = l_shipdate
AND l_orderkey >= o_orderkey - 10
AND l_orderkey < o_orderkey + 10
AND l_returnflag = 'R'
```
and the second one is
```sql
SELECT
o_orderkey
FROM
orders,
lineitem
WHERE
o_orderstatus = l_linestatus
AND l_orderkey >= o_orderkey - 10
AND l_orderkey < o_orderkey + 10
AND l_returnflag = 'R'
LIMIT 10000;
```
The second query involves key pairs with low cardinality. While `smallvec`
was effective in allocating new keys, deleting from it resulted in performance
issues. With the removal of the `smallvec` mechanism in this PR, we have
significantly improved performance.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]