erratic-pattern opened a new issue, #20935:
URL: https://github.com/apache/datafusion/issues/20935

   **Is your feature request related to a problem or challenge?**
   
   When string scalar functions (e.g. `regexp_replace`, `concat`, `lower`, 
`upper`, `replace`, `substr`, etc.) receive a `Dictionary(K, Utf8)` input, the 
current implementation unwraps the Dictionary to plain `Utf8`:
   
   1. The type coercion layer in 
`datafusion/expr/src/type_coercion/functions.rs` unwraps `Dictionary(K, V)` to 
`V` before the function sees it
   2. Each function's `return_type` also maps `Dictionary(_, Utf8)` → `Utf8`
   3. The output is a plain `Utf8` array
   
   This causes two problems:
   
   **Arrow IPC/Flight message size inflation**: Dictionary-encoded columns use 
a fixed size per row based on the key type (e.g. 4 bytes for `Int32`) with 
unique values stored once. Plain `Utf8` stores the full string for every row. 
For columns with repeated string values, this can significantly inflate 
individual Arrow IPC messages, potentially exceeding gRPC message size limits 
(4MB default in tonic).
   
   **Performance**: String operations are applied to every row instead of just 
the unique dictionary values.
   
   **Describe the solution you'd like**
   
   Make string scalar functions handle Dictionary-typed inputs natively, 
preserving the encoding:
   
   1. **`return_type`**: Return `Dictionary(K, V')` when input is 
`Dictionary(K, V)`, where `V'` is the appropriate result value type.
   2. **`invoke`/`invoke_with_args`**: When the input is a `DictionaryArray`, 
apply the string operation only to the dictionary's unique values array, then 
construct a new `DictionaryArray` with the transformed values and the original 
keys (index) array.
   
   Ideally this would be a reusable pattern (perhaps a helper or wrapper) that 
individual string functions can opt into, rather than duplicating the logic in 
every function.
   
   **Describe alternatives you've considered**
   
   - **Workaround**: Users can wrap results in `arrow_cast(some_string_fn(...), 
'Dictionary(Int32, Utf8)')` to re-dictionary-encode the output. However, this 
is non-obvious and still applies the operation to every row rather than just 
the unique values.
   - **Coerce Dictionary to Utf8View instead of Utf8**: The type coercion layer 
could coerce `Dictionary(K, Utf8)` to `Utf8View` rather than `Utf8`. Since 
string functions already handle `Utf8View` inputs and return `Utf8View` 
outputs, this would work without per-function changes. `Utf8View` is more 
compact than `Utf8` for repeated values in IPC serialization, though not as 
compact as Dictionary encoding.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to