gustavodemorais opened a new pull request, #27649: URL: https://github.com/apache/flink/pull/27649
We've been optimizing the engine to preserve upsert keys through CAST expressions. This PR explores extending that to BINARY/VARBINARY → CHAR/VARCHAR casts. The BYTES → STRING cast is not strictly injective: if the input contains invalid UTF-8 sequences, Java's new String(bytes, UTF_8) replaces each invalid byte with U+FFFD, causing distinct byte arrays to map to the same string. This PR adds tests that prove this is a real, reproducible issue - not just theoretical: - CastFunctionMiscITCase - pure Flink SQL with x'...' hex literals demonstrating collisions through the full engine. - CastFunctionMiscITCase#testBinaryToStringCastCollisionFromTable - creates a values source table with distinct VARBINARY rows and shows CAST(payload AS STRING) produces identical output for both. Why it might be safe to treat as injective anyway: The cast is injective for all valid UTF-8 input. If the bytes are not valid UTF-8, the resulting string already contains U+FFFD replacement characters - the output is broken regardless of whether we track upsert keys through the cast. We are not making a broken pipeline worse; we are optimizing the common case where the binary data is well-formed UTF-8. We have to decide if we make this relatively "strong" assumption for the engine or not. I opened the PR so we can more easily visualize and discuss if we want to proceed it the optimization or not. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
