erratic-pattern opened a new issue, #20935: URL: https://github.com/apache/datafusion/issues/20935
**Is your feature request related to a problem or challenge?** When string scalar functions (e.g. `regexp_replace`, `concat`, `lower`, `upper`, `replace`, `substr`, etc.) receive a `Dictionary(K, Utf8)` input, the current implementation unwraps the Dictionary to plain `Utf8`: 1. The type coercion layer in `datafusion/expr/src/type_coercion/functions.rs` unwraps `Dictionary(K, V)` to `V` before the function sees it 2. Each function's `return_type` also maps `Dictionary(_, Utf8)` → `Utf8` 3. The output is a plain `Utf8` array This causes two problems: **Arrow IPC/Flight message size inflation**: Dictionary-encoded columns use a fixed size per row based on the key type (e.g. 4 bytes for `Int32`) with unique values stored once. Plain `Utf8` stores the full string for every row. For columns with repeated string values, this can significantly inflate individual Arrow IPC messages, potentially exceeding gRPC message size limits (4MB default in tonic). **Performance**: String operations are applied to every row instead of just the unique dictionary values. **Describe the solution you'd like** Make string scalar functions handle Dictionary-typed inputs natively, preserving the encoding: 1. **`return_type`**: Return `Dictionary(K, V')` when input is `Dictionary(K, V)`, where `V'` is the appropriate result value type. 2. **`invoke`/`invoke_with_args`**: When the input is a `DictionaryArray`, apply the string operation only to the dictionary's unique values array, then construct a new `DictionaryArray` with the transformed values and the original keys (index) array. Ideally this would be a reusable pattern (perhaps a helper or wrapper) that individual string functions can opt into, rather than duplicating the logic in every function. **Describe alternatives you've considered** - **Workaround**: Users can wrap results in `arrow_cast(some_string_fn(...), 'Dictionary(Int32, Utf8)')` to re-dictionary-encode the output. However, this is non-obvious and still applies the operation to every row rather than just the unique values. - **Coerce Dictionary to Utf8View instead of Utf8**: The type coercion layer could coerce `Dictionary(K, Utf8)` to `Utf8View` rather than `Utf8`. Since string functions already handle `Utf8View` inputs and return `Utf8View` outputs, this would work without per-function changes. `Utf8View` is more compact than `Utf8` for repeated values in IPC serialization, though not as compact as Dictionary encoding. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
