Rich-T-kid opened a new issue, #23127:
URL: https://github.com/apache/datafusion/issues/23127

   ### Describe the bug
   
   When a `Dictionary(K, V)` column is used as a GROUP BY key, the output 
schema is fixed at plan time with the original key type K. If the number of 
distinct groups seen at runtime exceeds K's representable range, the query 
errors, even though the input data and the query are completely valid.
   
   ## core problem
   DataFusion's execution model locks the output schema at planning time. For 
dictionary columns this means the key type is fixed (e.g. UInt8). There is 
currently no mechanism for the key type to grow during execution to reflect the 
actual observed cardinality. The schema becomes a correctness constraint: a 
query over a `Dict<UInt8, Utf8>` column with 300 distinct values is valid SQL, 
but DataFusion will reject it at runtime purely because the pre-committed key 
type cannot represent 300 distinct indices. 
   
   
   ### To Reproduce
   
   ## Where it fails today
   
   The only current execution path for dictionary group keys is 
`GroupValuesRows`, which materializes values internally and then re-encodes 
back to the original dictionary type on emit. That re-encoding is where the 
key-type overflow surfaces as an opaque Arrow cast error. But the root cause is 
not GroupValuesRows specifically, any future specialised path would face the 
same constraint as long as the output schema's key type is fixed.
   
   ```
   #[test]
       fn dict_uint8_utf8_257_groups_emit() -> Result<()> {
           let schema = Arc::new(Schema::new(vec![Field::new(
               "k",
               DataType::Dictionary(Box::new(DataType::UInt8), 
Box::new(DataType::Utf8)),
               false,
           )]));
   
           let mut gv = GroupValuesRows::try_new(Arc::clone(&schema))?;
   
           let mut groups = vec![];
           for i in 0u32..257 {
               let mut builder = StringDictionaryBuilder::<UInt8Type>::new();
               builder.append_value(format!("group_{i}"));
               let arr: ArrayRef = Arc::new(builder.finish());
               gv.intern(&[arr], &mut groups)?;
           }
   
           assert_eq!(gv.len(), 257);
   
           let result = gv.emit(EmitTo::All);
           match &result {
               Ok(arrays) => {
                   assert_eq!(arrays.len(), 1, "expected one output column");
                   let out = &arrays[0];
                   assert_eq!(
                       out.len(),
                       257,
                       "emitted array should still have 257 rows; got {}",
                       out.len()
                   );
               }
               Err(e) => {
                   println!("emit returned error (expected for overflow): {e}");
               }
           }
   
           Ok(())
       }
   ```
   This produces the following error
   ```
   emit returned error (expected for overflow): Arrow error: Dictionary key 
bigger than the key type
   ```
   
   ### Expected behavior
   
   The output key type should be treated as a minimum width, not a maximum. 
When the observed distinct count exceeds the planned key type's range, the 
engine should promote the key type to the next sufficient width.
   
   ### Additional context
   
   @kumarUjjawal mentioned a similar idea 
[here](https://github.com/apache/datafusion/pull/21765#discussion_r3361313280)
   @tustvold encountered some issues with [dictionary array schemas 
](https://github.com/apache/datafusion/issues/7647)
   @zhuqi-lucas is working on including [nested 
types](https://github.com/apache/datafusion/issues/22715) for 
`GroupValuesColumn` so this work may align.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to