theirix commented on issue #12709:
URL: https://github.com/apache/datafusion/issues/12709#issuecomment-4194430343
I suggest reopening this issue to discuss semantics again.
The issue is that the logic was introduced for the N-ary `concat` UDF, but
not to the binary pipe operator (`||`) - they return text but have entirely
different code. While `concat` allows invalid UTF-8 in binary inputs by
deferring validation to the entire concatenated binary, the pipe operator
doesn't do so. The optimiser tries to optimise this plan
```
Projection: CAST(Binary("255") AS Utf8) || CAST(Binary("175") AS Utf8)
EmptyRelation: rows=1
```
and [fails trying to evaluate
literals](https://github.com/apache/datafusion/blob/206ac6b950e709373a39e74ebb33ed0c95edce9e/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L550).
The same error with `concat`. It doesn't happen when arguments are valid UTF-8
binaries as per added SLTs -
https://github.com/apache/datafusion/compare/main...theirix:datafusion:binary-concat-fixes
. Of course, if we follow Postgres' behaviour with hexifying binary output, it
won't happen.
While I was summarising what other engines do, I discovered that Postgres is
actually an outlier, and it handles binaries by emitting a hex representation,
as pointed out above. This Postgres implementation for both UDF and operator
emits binary as decoded text, not as hexified string `SELECT x'c3a9' ||
'hello'::text` -> `éhello` instead of `1100001110101001hello` with Postgres.
Engine | str+str | str+bin | bin+bin
-- | -- | -- | --
PostgreSQL | text | Mixed text and hexified bytes | Binary (if both args are
bytea) or hexified str
ClickHouse | String of bytes | String of bytes | String of bytes
Spark/Databricks | String | Type error (must cast) | Binary
DuckDB | String | Type error (must cast) | Binary
Snowflake | String | Type error (must cast) | Binary
MySQL | String | Binary string | Binary string
Now I'm wondering if we should provide bin+bin -> bin logic by changing
`concat` UDF and providing a totally new `BinaryConcat` operator instead of
`StringConcat` when both operands are binaries. It depends on the choice
between Postgres and various AnalyticalDBs.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]