Gustavo de Morais created FLINK-39600:
-----------------------------------------

             Summary: FLIP-568: Strict BYTES-to-STRING CAST with UTF-8 
Validation Utilities
                 Key: FLINK-39600
                 URL: https://issues.apache.org/jira/browse/FLINK-39600
             Project: Flink
          Issue Type: New Feature
          Components: Table SQL / API
            Reporter: Gustavo de Morais
            Assignee: Gustavo de Morais


{{CAST(bytes AS STRING)}} today silently replaces invalid UTF-8 with the 
Unicode replacement character {{{}U+FFFD{}}}. The substitution is irreversible 
and produces no warning - pipelines keep running while data is permanently 
corrupted downstream. This also blocks engine optimizations that need injective 
guarantees (e.g. upsert key propagation through {{{}BINARY -> STRING{}}}).

[FLIP-568|https://cwiki.apache.org/confluence/display/FLINK/FLIP-568%3A+Strict+BYTES-to-STRING+CAST+with+UTF-8+Validation+Utilities]
 addresses this by:
 # Making {{CAST(bytes AS STRING)}} throw on invalid UTF-8. {{TRY_CAST}} 
returns {{{}NULL{}}}. A migration flag restores the legacy behavior.
 # Adding two SQL functions:
 ** {{IS_VALID_UTF8(bytes) -> BOOLEAN}} for routing invalid records to a 
dead-letter sink
 ** {{MAKE_VALID_UTF8(bytes) -> STRING}} as the explicit, opt-in substitution 
recipe
 # Adding {{StringData.fromUtf8Bytes(byte[])}} connector API that validates at 
ingestion and throws on invalid input.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to