[PR] feat: support binary and string types for concat UDFs [datafusion]

via GitHub Sat, 16 May 2026 02:16:18 -0700


theirix opened a new pull request, #22244:
URL: https://github.com/apache/datafusion/pull/22244


   ## Which issue does this PR close?
   
   - Closes #12709.
   
   ## Rationale for this change
   
   While #21883 introduced binary argument support for the pipe operator, this 
PR targets three UDFs: `concat`, `concat_ws`, and Spark's `concat` to harmonise 
all their behaviour.
   
   We decided to support only string+string and binary+binary and ban mixed 
operations, to match [the behaviour of the majority of 
engines](https://github.com/apache/datafusion/issues/12709#issuecomment-4194430343).
 Previously, mixed behaviour was allowed in #20787.
   
   ## What changes are included in this PR?
   
   - Added support for binary types (four variants)
   - Disallowed mixed string/binary operations
   - Fixed edge cases when binary->string type coercion overrode UDF rules, so 
mixed calls were allowed
   - Refactored the three UDFs by extracting duplicate code into shared helpers 
to keep the logic centralised
   - Refined `concat_ws` behavior for different separator types
   
   The diff is quite large. Detailed code changes:
   - Added a trait `ConcatBuilder` to abstract string/binary/view/array 
operations
   - Abstracted `StringArray`/`LargeStringArray` into a generic 
`ConcatGenericStringBuilder` - less duplication
   - Introduced mirrored builders for binary types in a new file `binaries.rs` 
   - Introduced more `ColumnarValueRef` variants to handle nullable and 
non-nullable binary types
   - Extracted the ColumnarValue -> ColumnarValueRef builder to 
`from_columnar_value` - simplified call sites. Scalar code path stays mostly 
the same
   - Switched from `Signature::variadic` to `Signature::UserDefined` to allow 
different argument types.`Variadic` required every argument (including 
binaries) to be coerced to the same string type, so the UDF cannot distinguish 
between binary and string inputs. It's a relatively uncommon - happy to discuss
   - Simplified Spark's concat significantly by reusing the `concat` 
implementation, so it handles only Spark-specific null-handling
   - Moved SLTs from Spark SLT (`spark/concat.slt`) to a generic SLT, so we 
test both generic and Spark-specific behaviour
   
   
   ## Are these changes tested?
   
   - Added more unit tests for previously uncovered major code paths
   - Added more SLTs, especially for type coercion
   
   ## Are there any user-facing changes?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] feat: support binary and string types for concat UDFs [datafusion]

Reply via email to