Hi everyone,

Currently, CAST(bytes AS STRING) silently replaces any invalid UTF-8 byte
with U+FFFD (?). The substitution is irreversible and produces no warning -
the pipeline keeps running while data is permanently corrupted
downstream. This also means that a CAST from BYTES → STRING → BYTES is not
idempotent, which prevents the engine from applying certain optimizations.
For example, for preserving upsert keys after such CASTs.

I'd like to start a discussion around defining and improving the default
behavior. I've written a short FLIP [1] proposing new utility functions to
handle this explicitly - similar to what other engines like Spark already
do - and changing the default behavior to throw an error instead of
silently corrupting data, while giving users clear options to deal with
invalid bytes.

Looking forward to your feedback and thoughts.

Kind regards,
Gustavo

[1]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-568%3A+Strict+BYTES-to-STRING+CAST+with+UTF-8+Validation+Utilities

Reply via email to