[ 
https://issues.apache.org/jira/browse/FLINK-39125?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gustavo de Morais updated FLINK-39125:
--------------------------------------
    Description: 
When users cast a VARBINARY key column to VARCHAR, the upsert key uniqueness is 
lost because the cast is not recognized as injective.

UTF-8 decoding is itself injective when the input is valid UTF-8 - distinct 
byte sequences always produce distinct strings - so we can safely mark these 
casts as injective when the string target has sufficient capacity. The cast is 
injective under the following conditions:
 * {{{}VARBINARY(MAX) → VARCHAR(MAX){}}}: both sides are unbounded
 * {{VARBINARY(n) → VARCHAR(m)}} where {{{}m >= n{}}}: UTF-8 multi-byte 
sequences decode to fewer characters than source bytes (each character takes at 
least 1 byte), so {{n}} bytes always decode to at most {{n}} characters
 * Bounded source to unbounded ({{{}MAX{}}}) target: always fits

This applies to all four cross-family combinations: 
{{{}BINARY{}}}/{{{}VARBINARY{}}} to {{{}CHAR{}}}/{{{}VARCHAR{}}}.

*Blocker:* This is currently not safe to implement. Flink's {{CAST(bytes AS 
STRING)}} silently replaces invalid UTF-8 byte sequences with the Unicode 
replacement character {{U+FFFD}} ({{{}\uFFFD{}}}), making the cast 
non-injective - two distinct byte arrays can produce the same string. This must 
be fixed first.

  was:
When users cast a VARCHAR key column to VARBINARY, the upsert key uniqueness is 
lost because the cast is not recognized as injective.UTF-8 encoding is itself 
injective - distinct strings always produce distinct byte sequences - so we can 
safely mark these casts as injective when the binary target has sufficient 
capacity. The cast is injective under the following conditions:
 * VARCHAR(MAX) → VARBINARY(MAX): both sides are unbounded

 * VARCHAR → VARBINARY where y >= x * 4: target can hold the worst-case UTF-8 
encoding (4 bytes per character)

 * Bounded source to unbounded (MAX) target: always fits

This applies to all four cross-family combinations: CHAR/VARCHAR to 
BINARY/VARBINARY.


> Support injective casts from BINARY/VARBINARY to CHAR/VARCHAR for upsert key 
> preservation
> -----------------------------------------------------------------------------------------
>
>                 Key: FLINK-39125
>                 URL: https://issues.apache.org/jira/browse/FLINK-39125
>             Project: Flink
>          Issue Type: Improvement
>          Components: Table SQL / Planner
>    Affects Versions: 2.2.0
>            Reporter: Gustavo de Morais
>            Assignee: Gustavo de Morais
>            Priority: Major
>             Fix For: 2.3.0
>
>
> When users cast a VARBINARY key column to VARCHAR, the upsert key uniqueness 
> is lost because the cast is not recognized as injective.
> UTF-8 decoding is itself injective when the input is valid UTF-8 - distinct 
> byte sequences always produce distinct strings - so we can safely mark these 
> casts as injective when the string target has sufficient capacity. The cast 
> is injective under the following conditions:
>  * {{{}VARBINARY(MAX) → VARCHAR(MAX){}}}: both sides are unbounded
>  * {{VARBINARY(n) → VARCHAR(m)}} where {{{}m >= n{}}}: UTF-8 multi-byte 
> sequences decode to fewer characters than source bytes (each character takes 
> at least 1 byte), so {{n}} bytes always decode to at most {{n}} characters
>  * Bounded source to unbounded ({{{}MAX{}}}) target: always fits
> This applies to all four cross-family combinations: 
> {{{}BINARY{}}}/{{{}VARBINARY{}}} to {{{}CHAR{}}}/{{{}VARCHAR{}}}.
> *Blocker:* This is currently not safe to implement. Flink's {{CAST(bytes AS 
> STRING)}} silently replaces invalid UTF-8 byte sequences with the Unicode 
> replacement character {{U+FFFD}} ({{{}\uFFFD{}}}), making the cast 
> non-injective - two distinct byte arrays can produce the same string. This 
> must be fixed first.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to