Re: Re: [DISCUSS] FLIP-568: Strict BYTES-to-STRING CAST with UTF-8 Validation Utilities

Gustavo de Morais Fri, 20 Mar 2026 12:09:04 -0700

Hi Xuyang and Timo,

Thanks for the positive feedback! Regarding your suggestions, Xuyang:


1. Yes, good point - we should add the fromUtf8Bytes(byte[], int, int)
overload as well.
2. This is also relevant. If we want to do validation during ingestion
time, this might have performance implications. Since these are @Internal
APIs, they can be changed independently from the FLIP afterwards if it
makes sense. What's your opinion?

Kind regards,
Gustavo

On Fri, 20 Mar 2026 at 08:44, Xuyang <[email protected]> wrote:

> Hi, Gustavo.
> Great catch! Thanks for driving this FLIP. Overall LGTM. I just have two
> minor points I'd like to confirm with you.
> 1. Should we also add the overload function `fromUtf8Bytes(byte[], int,
> int)` in StringData?
> 2. Callers like `ColumnarRowData#getString` and
> `ColumnarArrayData#getString`  call `StringData.fromBytes` directly. Should
> these call sites be migrated in a follow-up, or intentionally left as-is?
>
>
>
>
>
> --
>
>     Best！
>     Xuyang
>
>
>
> At 2026-03-19 22:37:28, "Timo Walther" <[email protected]> wrote:
> >Hi Gustavo,
> >
> >thank you for this excellent design document. And thanks for discovering
> >this data loss and driving the investigation. We should definitely fix
> >this shortcoming. Also looking at other vendors, it is definitly a cause
> >for false assumptions that lead to hard-to-debug inconsistencies.
> >
> >+1 for this proposal.
> >
> >Cheers,
> >Timo
> >
> >
> >On 19.03.26 15:23, Gustavo de Morais wrote:
> >> Hi everyone,
> >>
> >> Currently, CAST(bytes AS STRING) silently replaces any invalid UTF-8
> byte
> >> with U+FFFD (?). The substitution is irreversible and produces no
> warning -
> >> the pipeline keeps running while data is permanently corrupted
> >> downstream. This also means that a CAST from BYTES → STRING → BYTES is
> not
> >> idempotent, which prevents the engine from applying certain
> optimizations.
> >> For example, for preserving upsert keys after such CASTs.
> >>
> >> I'd like to start a discussion around defining and improving the default
> >> behavior. I've written a short FLIP [1] proposing new utility functions
> to
> >> handle this explicitly - similar to what other engines like Spark
> already
> >> do - and changing the default behavior to throw an error instead of
> >> silently corrupting data, while giving users clear options to deal with
> >> invalid bytes.
> >>
> >> Looking forward to your feedback and thoughts.
> >>
> >> Kind regards,
> >> Gustavo
> >>
> >> [1]
> >>
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-568%3A+Strict+BYTES-to-STRING+CAST+with+UTF-8+Validation+Utilities
> >>
>

Re: Re: [DISCUSS] FLIP-568: Strict BYTES-to-STRING CAST with UTF-8 Validation Utilities

Reply via email to