theirix commented on issue #12709:
URL: https://github.com/apache/datafusion/issues/12709#issuecomment-4194430343

   I suggest reopening this issue to discuss semantics again.
   
   The issue is that the logic was introduced for the N-ary `concat` UDF, but 
not to the binary pipe operator (`||`) - they return text but have entirely 
different code.  While `concat` allows invalid UTF-8 in binary inputs by 
deferring validation to the entire concatenated binary, the pipe operator 
doesn't do so. The optimiser tries to optimise this plan
   ```
   Projection: CAST(Binary("255") AS Utf8) || CAST(Binary("175") AS Utf8)
         EmptyRelation: rows=1
   ```
   and [fails trying to evaluate 
literals](https://github.com/apache/datafusion/blob/206ac6b950e709373a39e74ebb33ed0c95edce9e/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L550).
 The same error with `concat`. It doesn't happen when arguments are valid UTF-8 
binaries as per added SLTs - 
https://github.com/apache/datafusion/compare/main...theirix:datafusion:binary-concat-fixes
 . Of course, if we follow Postgres' behaviour with hexifying binary output, it 
won't happen.
   
   While I was summarising what other engines do, I discovered that Postgres is 
actually an outlier, and it handles binaries by emitting a hex representation, 
as pointed out above. This Postgres implementation for both UDF and operator 
emits binary as decoded text, not as hexified string `SELECT x'c3a9' || 
'hello'::text` -> `éhello` instead of `1100001110101001hello` with Postgres.
   
   Engine | str+str | str+bin | bin+bin
   -- | -- | -- | --
   PostgreSQL | text | Mixed text and hexified bytes | Binary (if both args are 
bytea) or hexified str
   ClickHouse | String of bytes | String of bytes | String of bytes
   Spark/Databricks | String | Type error (must cast) | Binary
   DuckDB | String | Type error (must cast) | Binary
   Snowflake | String | Type error (must cast) | Binary
   MySQL | String | Binary string | Binary string
   
   Now I'm wondering if we should provide bin+bin -> bin logic by changing 
`concat` UDF and providing a totally new `BinaryConcat` operator instead of 
`StringConcat` when both operands are binaries. It depends on the choice 
between Postgres and various AnalyticalDBs.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to