Gustavo de Morais created FLINK-39600:
-----------------------------------------
Summary: FLIP-568: Strict BYTES-to-STRING CAST with UTF-8
Validation Utilities
Key: FLINK-39600
URL: https://issues.apache.org/jira/browse/FLINK-39600
Project: Flink
Issue Type: New Feature
Components: Table SQL / API
Reporter: Gustavo de Morais
Assignee: Gustavo de Morais
{{CAST(bytes AS STRING)}} today silently replaces invalid UTF-8 with the
Unicode replacement character {{{}U+FFFD{}}}. The substitution is irreversible
and produces no warning - pipelines keep running while data is permanently
corrupted downstream. This also blocks engine optimizations that need injective
guarantees (e.g. upsert key propagation through {{{}BINARY -> STRING{}}}).
[FLIP-568|https://cwiki.apache.org/confluence/display/FLINK/FLIP-568%3A+Strict+BYTES-to-STRING+CAST+with+UTF-8+Validation+Utilities]
addresses this by:
# Making {{CAST(bytes AS STRING)}} throw on invalid UTF-8. {{TRY_CAST}}
returns {{{}NULL{}}}. A migration flag restores the legacy behavior.
# Adding two SQL functions:
** {{IS_VALID_UTF8(bytes) -> BOOLEAN}} for routing invalid records to a
dead-letter sink
** {{MAKE_VALID_UTF8(bytes) -> STRING}} as the explicit, opt-in substitution
recipe
# Adding {{StringData.fromUtf8Bytes(byte[])}} connector API that validates at
ingestion and throws on invalid input.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)