theirix commented on issue #12709:
URL: https://github.com/apache/datafusion/issues/12709#issuecomment-4275915605
> I suggest reopening this issue to discuss semantics again.
>
> The issue is that the logic was introduced for the N-ary `concat` UDF, but
not to the binary pipe operator (`||`) - they return text but have entirely
different code. While `concat` allows invalid UTF-8 in binary inputs by
deferring validation to the entire concatenated binary, the pipe operator
doesn't do so. The optimiser tries to optimise this plan
>
> ```
> Projection: CAST(Binary("255") AS Utf8) || CAST(Binary("175") AS Utf8)
> EmptyRelation: rows=1
> ```
>
> and [fails trying to evaluate
literals](https://github.com/apache/datafusion/blob/206ac6b950e709373a39e74ebb33ed0c95edce9e/datafusion/optimizer/src/simplify_expressions/expr_simplifier.rs#L550).
The same error with `concat`. It doesn't happen when arguments are valid UTF-8
binaries as per added SLTs -
[main...theirix:datafusion:binary-concat-fixes](https://github.com/apache/datafusion/compare/main...theirix:datafusion:binary-concat-fixes)
. Of course, if we follow Postgres' behaviour with hexifying binary output, it
won't happen.
>
> While I was summarising what other engines do, I discovered that Postgres
is actually an outlier, and it handles binaries by emitting a hex
representation, as pointed out above. This Postgres implementation for both UDF
and operator emits binary as decoded text, not as hexified string `SELECT
x'c3a9' || 'hello'::text` -> `éhello` instead of `1100001110101001hello` with
Postgres.
> Engine str+str str+bin bin+bin
> PostgreSQL text Mixed text and hexified bytes Binary (if both
args are bytea) or hexified str
> ClickHouse String of bytes String of bytes String
of bytes
> Spark/Databricks String Type error (must cast) Binary
> DuckDB String Type error (must cast) Binary
> Snowflake String Type error (must cast) Binary
> MySQL String Binary string Binary string
>
> Now I'm wondering if we should provide bin+bin -> bin logic by changing
`concat` UDF and providing a totally new `BinaryConcat` operator instead of
`StringConcat` when both operands are binaries. It depends on the choice
between Postgres and various AnalyticalDBs.
@Jefffrey @alamb @jayzhan211 I would like to ask for your opinion on these
findings and on which path we should follow.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]