erratic-pattern opened a new issue, #20437:
URL: https://github.com/apache/datafusion/issues/20437
# Panic in HashJoin with dictionary-encoded column in multi-column join key
## Describe the bug
When executing a hash join with multiple join keys where one column is
dictionary-encoded with fewer unique values than rows, DataFusion panics with:
```
InvalidArgumentError("Incorrect array length for StructArray field \"c1\",
expected N got M")
```
## To Reproduce
```sql
-- Small table with dictionary-encoded region (2 rows, 1 unique value)
CREATE TABLE small AS
SELECT id, arrow_cast(region, 'Dictionary(Int32, Utf8)') as region
FROM (VALUES (1, 'west'), (2, 'west')) AS t(id, region);
CREATE TABLE large AS
SELECT id, region, value
FROM (VALUES (1, 'west', 100), (2, 'west', 200), (3, 'east', 300)) AS t(id,
region, value);
-- Multi-column join triggers panic
SELECT s.id, s.region, l.value
FROM small s
JOIN large l ON s.id = l.id AND s.region = l.region;
```
## Expected behavior
Query returns 2 rows:
```
+----+--------+-------+
| id | region | value |
+----+--------+-------+
| 1 | west | 100 |
| 2 | west | 200 |
+----+--------+-------+
```
## Actual behavior
Panic:
```
thread 'main' panicked at arrow-array/src/array/struct_array.rs:91:46:
called `Result::unwrap()` on an `Err` value: InvalidArgumentError("Incorrect
array length for StructArray field \"c1\", expected 3 got 2")
```
## Root cause
In
[`flatten_dictionary_array`](https://github.com/apache/datafusion/blob/52.1.0/datafusion/physical-plan/src/joins/hash_join/inlist_builder.rs#L37-L45)
introduced by #18393:
```rust
fn flatten_dictionary_array(array: &ArrayRef) -> ArrayRef {
downcast_dictionary_array! {
array => {
flatten_dictionary_array(array.values()) // BUG: returns only
unique values
}
_ => Arc::clone(array)
}
}
```
The function calls `array.values()` which returns the dictionary's unique
values array (length = number of unique values), not the expanded array (length
= number of rows).
When building a `StructArray` for multi-column join keys, this causes a
length mismatch between dictionary columns (incorrectly shortened) and
non-dictionary columns.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]