theirix commented on issue #12709:
URL: https://github.com/apache/datafusion/issues/12709#issuecomment-4275915605

   > I suggest reopening this issue to discuss semantics again.
   > 
   > The issue is that the logic was introduced for the N-ary `concat` UDF, but 
not to the binary pipe operator (`||`) - they return text but have entirely 
different code. While `concat` allows invalid UTF-8 in binary inputs by 
deferring validation to the entire concatenated binary, the pipe operator 
doesn't do so. The optimiser tries to optimise this plan
   > 
   > ```
   > Projection: CAST(Binary("255") AS Utf8) || CAST(Binary("175") AS Utf8)
   >       EmptyRelation: rows=1
   > ```
   > 
   > and [fails trying to evaluate 
literals](https://github.com/apache/datafusion/blob/206ac6b950e709373a39e74ebb33ed0c95edce9e/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L550).
 The same error with `concat`. It doesn't happen when arguments are valid UTF-8 
binaries as per added SLTs - 
[main...theirix:datafusion:binary-concat-fixes](https://github.com/apache/datafusion/compare/main...theirix:datafusion:binary-concat-fixes)
 . Of course, if we follow Postgres' behaviour with hexifying binary output, it 
won't happen.
   > 
   > While I was summarising what other engines do, I discovered that Postgres 
is actually an outlier, and it handles binaries by emitting a hex 
representation, as pointed out above. This Postgres implementation for both UDF 
and operator emits binary as decoded text, not as hexified string `SELECT 
x'c3a9' || 'hello'::text` -> `éhello` instead of `1100001110101001hello` with 
Postgres.
   > Engine     str+str         str+bin         bin+bin
   > PostgreSQL         text    Mixed text and hexified bytes   Binary (if both 
args are bytea) or hexified str
   > ClickHouse         String of bytes         String of bytes         String 
of bytes
   > Spark/Databricks   String  Type error (must cast)  Binary
   > DuckDB     String  Type error (must cast)  Binary
   > Snowflake  String  Type error (must cast)  Binary
   > MySQL      String  Binary string   Binary string
   > 
   > Now I'm wondering if we should provide bin+bin -> bin logic by changing 
`concat` UDF and providing a totally new `BinaryConcat` operator instead of 
`StringConcat` when both operands are binaries. It depends on the choice 
between Postgres and various AnalyticalDBs.
   
   @Jefffrey @alamb @jayzhan211 I would like to ask for your opinion on these 
findings and on which path we should follow.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to