Hi everyone, Currently, CAST(bytes AS STRING) silently replaces any invalid UTF-8 byte with U+FFFD (?). The substitution is irreversible and produces no warning - the pipeline keeps running while data is permanently corrupted downstream. This also means that a CAST from BYTES → STRING → BYTES is not idempotent, which prevents the engine from applying certain optimizations. For example, for preserving upsert keys after such CASTs.
I'd like to start a discussion around defining and improving the default behavior. I've written a short FLIP [1] proposing new utility functions to handle this explicitly - similar to what other engines like Spark already do - and changing the default behavior to throw an error instead of silently corrupting data, while giving users clear options to deal with invalid bytes. Looking forward to your feedback and thoughts. Kind regards, Gustavo [1] https://cwiki.apache.org/confluence/display/FLINK/FLIP-568%3A+Strict+BYTES-to-STRING+CAST+with+UTF-8+Validation+Utilities
