aditanase opened a new issue, #16106:
URL: https://github.com/apache/datafusion/issues/16106
### Describe the bug
I have a simple use case for a star-schema join between a facts table and 2
metadata tables (one of them small, another one larger but would still make
sense to collect in memory - 200K rows, ~90 MB). Here is a simplified version
of this query:
```sql
SELECT
fm.Timestamp,
SUM(dm.BasePrice * fm.ConsumedUnits) AS BilledUnits
FROM dr
JOIN fm ON fm.ResourceId = dr.ResourceId
JOIN dm ON fm.MeterId = dm.MeterId
WHERE
fm.Timestamp > (now() - interval '10' day) AND
dm.MeterName = 'Usage' AND
dr.Region != ''
GROUP BY fm.Timestamp
```
The join is executed on a string column. All 3 tables are deltalake tables
loaded with `delta-rs`.
My goal is to convince DF to execute the query as described here: ["right
deep
tree"](https://github.com/apache/datafusion/blob/f77579108d1dc0285636fbfb24507d2bfca66446/datafusion/physical-plan/src/joins/hash_join.rs#L276-L312).
After tweaking the config a bit - mainly disabling repartition for joins - I
managed to obtain 2 hash joins with mode `CollectLeft`, but the query will OOM
trying to collect the facts table after the first join, with the 2nd metadata
table remaining on the right.
Drilling in the code, this seems to happen because on the 2nd join the
optimizer will bail since there are no statistics pushed through the Join, and
left and right are not swapped. The facts table remains in the middle of the
tree.
The reason appears to be that delta-rs does not generate columns stats for
strings, [only for "primitive
types"](https://github.com/delta-io/delta-rs/blob/main/crates/core/src/kernel/snapshot/log_data.rs#L620-L639).
I have test a patch that will swap left and right as a last resort if we
already have a join with a CollectLeft in the left node - basically trying to
keep the pshysical shape of a "right deep tree":
https://github.com/hstack/arrow-datafusion/commit/eea1ff4344f2d36e134af8e29472137c10a0a5d4
If you think this is a sensible approach, we may want to add a config flag
to fence this, but otherwise I can raise a PR with this.
I am wondering how you generally feel about other ways to guide the
optimizer:
- enforce min/max for string columns in delta-rs (does not seem like good
value in general, you can't really compute cardinalities and such)
- adding some sorf ot join hints in the SQL planner [like we have in
spark](https://downloads.apache.org/spark/docs/3.0.0/sql-ref-syntax-qry-select-hints.html#join-hints)
- other ideas?
cc @alamb
### To Reproduce
_No response_
### Expected behavior
CollectLeft optimizations to work for more than 2 tables, even if the join
key is string and column stats are missing - I thinks this is a common case for
star schema joins.
### Additional context
_No response_
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]