gustavodemorais opened a new pull request, #27649:
URL: https://github.com/apache/flink/pull/27649

   We've been optimizing the engine to preserve upsert keys through CAST 
expressions. This PR explores extending that to BINARY/VARBINARY → CHAR/VARCHAR 
casts.
   
   The BYTES → STRING cast is not strictly injective: if the input contains 
invalid UTF-8 sequences, Java's new String(bytes, UTF_8) replaces each invalid 
byte with U+FFFD, causing distinct byte arrays to map to the same string. This 
PR adds tests that prove this is a real, reproducible issue - not just 
theoretical:
   
   - CastFunctionMiscITCase -  pure Flink SQL with x'...' hex literals 
demonstrating collisions through the full engine.
   - CastFunctionMiscITCase#testBinaryToStringCastCollisionFromTable - creates 
a values source table with distinct VARBINARY rows and shows CAST(payload AS 
STRING) produces identical output for both.
   
   
   Why it might be safe to treat as injective anyway: The cast is injective for 
all valid UTF-8 input. If the bytes are not valid UTF-8, the resulting string 
already contains U+FFFD replacement characters - the output is broken 
regardless of whether we track upsert keys through the cast. We are not making 
a broken pipeline worse; we are optimizing the common case where the binary 
data is well-formed UTF-8. 
   
   We have to decide if we make this relatively "strong" assumption for the 
engine or not. I opened the PR so we can more easily visualize and discuss if 
we want to proceed it the optimization or not.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to