erratic-pattern opened a new issue, #20437:
URL: https://github.com/apache/datafusion/issues/20437

   # Panic in HashJoin with dictionary-encoded column in multi-column join key
   
   ## Describe the bug
   
   When executing a hash join with multiple join keys where one column is 
dictionary-encoded with fewer unique values than rows, DataFusion panics with:
   
   ```
   InvalidArgumentError("Incorrect array length for StructArray field \"c1\", 
expected N got M")
   ```
   
   ## To Reproduce
   
   ```sql
   -- Small table with dictionary-encoded region (2 rows, 1 unique value)
   CREATE TABLE small AS
   SELECT id, arrow_cast(region, 'Dictionary(Int32, Utf8)') as region
   FROM (VALUES (1, 'west'), (2, 'west')) AS t(id, region);
   
   CREATE TABLE large AS
   SELECT id, region, value
   FROM (VALUES (1, 'west', 100), (2, 'west', 200), (3, 'east', 300)) AS t(id, 
region, value);
   
   -- Multi-column join triggers panic
   SELECT s.id, s.region, l.value
   FROM small s
   JOIN large l ON s.id = l.id AND s.region = l.region;
   ```
   
   ## Expected behavior
   
   Query returns 2 rows:
   ```
   +----+--------+-------+
   | id | region | value |
   +----+--------+-------+
   | 1  | west   | 100   |
   | 2  | west   | 200   |
   +----+--------+-------+
   ```
   
   ## Actual behavior
   
   Panic:
   ```
   thread 'main' panicked at arrow-array/src/array/struct_array.rs:91:46:
   called `Result::unwrap()` on an `Err` value: InvalidArgumentError("Incorrect 
array length for StructArray field \"c1\", expected 3 got 2")
   ```
   
   ## Root cause
   
   In 
[`flatten_dictionary_array`](https://github.com/apache/datafusion/blob/52.1.0/datafusion/physical-plan/src/joins/hash_join/inlist_builder.rs#L37-L45)
 introduced by #18393:
   
   ```rust
   fn flatten_dictionary_array(array: &ArrayRef) -> ArrayRef {
       downcast_dictionary_array! {
           array => {
               flatten_dictionary_array(array.values())  // BUG: returns only 
unique values
           }
           _ => Arc::clone(array)
       }
   }
   ```
   
   The function calls `array.values()` which returns the dictionary's unique 
values array (length = number of unique values), not the expanded array (length 
= number of rows).
   
   When building a `StructArray` for multi-column join keys, this causes a 
length mismatch between dictionary columns (incorrectly shortened) and 
non-dictionary columns.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to