Rich-T-kid opened a new issue, #23127:
URL: https://github.com/apache/datafusion/issues/23127
### Describe the bug
When a `Dictionary(K, V)` column is used as a GROUP BY key, the output
schema is fixed at plan time with the original key type K. If the number of
distinct groups seen at runtime exceeds K's representable range, the query
errors, even though the input data and the query are completely valid.
## core problem
DataFusion's execution model locks the output schema at planning time. For
dictionary columns this means the key type is fixed (e.g. UInt8). There is
currently no mechanism for the key type to grow during execution to reflect the
actual observed cardinality. The schema becomes a correctness constraint: a
query over a `Dict<UInt8, Utf8>` column with 300 distinct values is valid SQL,
but DataFusion will reject it at runtime purely because the pre-committed key
type cannot represent 300 distinct indices.
### To Reproduce
## Where it fails today
The only current execution path for dictionary group keys is
`GroupValuesRows`, which materializes values internally and then re-encodes
back to the original dictionary type on emit. That re-encoding is where the
key-type overflow surfaces as an opaque Arrow cast error. But the root cause is
not GroupValuesRows specifically, any future specialised path would face the
same constraint as long as the output schema's key type is fixed.
```
#[test]
fn dict_uint8_utf8_257_groups_emit() -> Result<()> {
let schema = Arc::new(Schema::new(vec![Field::new(
"k",
DataType::Dictionary(Box::new(DataType::UInt8),
Box::new(DataType::Utf8)),
false,
)]));
let mut gv = GroupValuesRows::try_new(Arc::clone(&schema))?;
let mut groups = vec![];
for i in 0u32..257 {
let mut builder = StringDictionaryBuilder::<UInt8Type>::new();
builder.append_value(format!("group_{i}"));
let arr: ArrayRef = Arc::new(builder.finish());
gv.intern(&[arr], &mut groups)?;
}
assert_eq!(gv.len(), 257);
let result = gv.emit(EmitTo::All);
match &result {
Ok(arrays) => {
assert_eq!(arrays.len(), 1, "expected one output column");
let out = &arrays[0];
assert_eq!(
out.len(),
257,
"emitted array should still have 257 rows; got {}",
out.len()
);
}
Err(e) => {
println!("emit returned error (expected for overflow): {e}");
}
}
Ok(())
}
```
This produces the following error
```
emit returned error (expected for overflow): Arrow error: Dictionary key
bigger than the key type
```
### Expected behavior
The output key type should be treated as a minimum width, not a maximum.
When the observed distinct count exceeds the planned key type's range, the
engine should promote the key type to the next sufficient width.
### Additional context
@kumarUjjawal mentioned a similar idea
[here](https://github.com/apache/datafusion/pull/21765#discussion_r3361313280)
@tustvold encountered some issues with [dictionary array schemas
](https://github.com/apache/datafusion/issues/7647)
@zhuqi-lucas is working on including [nested
types](https://github.com/apache/datafusion/issues/22715) for
`GroupValuesColumn` so this work may align.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]