Hi Xuyang and Timo, Thanks for the positive feedback! Regarding your suggestions, Xuyang:
1. Yes, good point - we should add the fromUtf8Bytes(byte[], int, int) overload as well. 2. This is also relevant. If we want to do validation during ingestion time, this might have performance implications. Since these are @Internal APIs, they can be changed independently from the FLIP afterwards if it makes sense. What's your opinion? Kind regards, Gustavo On Fri, 20 Mar 2026 at 08:44, Xuyang <[email protected]> wrote: > Hi, Gustavo. > Great catch! Thanks for driving this FLIP. Overall LGTM. I just have two > minor points I'd like to confirm with you. > 1. Should we also add the overload function `fromUtf8Bytes(byte[], int, > int)` in StringData? > 2. Callers like `ColumnarRowData#getString` and > `ColumnarArrayData#getString` call `StringData.fromBytes` directly. Should > these call sites be migrated in a follow-up, or intentionally left as-is? > > > > > > -- > > Best! > Xuyang > > > > At 2026-03-19 22:37:28, "Timo Walther" <[email protected]> wrote: > >Hi Gustavo, > > > >thank you for this excellent design document. And thanks for discovering > >this data loss and driving the investigation. We should definitely fix > >this shortcoming. Also looking at other vendors, it is definitly a cause > >for false assumptions that lead to hard-to-debug inconsistencies. > > > >+1 for this proposal. > > > >Cheers, > >Timo > > > > > >On 19.03.26 15:23, Gustavo de Morais wrote: > >> Hi everyone, > >> > >> Currently, CAST(bytes AS STRING) silently replaces any invalid UTF-8 > byte > >> with U+FFFD (?). The substitution is irreversible and produces no > warning - > >> the pipeline keeps running while data is permanently corrupted > >> downstream. This also means that a CAST from BYTES → STRING → BYTES is > not > >> idempotent, which prevents the engine from applying certain > optimizations. > >> For example, for preserving upsert keys after such CASTs. > >> > >> I'd like to start a discussion around defining and improving the default > >> behavior. I've written a short FLIP [1] proposing new utility functions > to > >> handle this explicitly - similar to what other engines like Spark > already > >> do - and changing the default behavior to throw an error instead of > >> silently corrupting data, while giving users clear options to deal with > >> invalid bytes. > >> > >> Looking forward to your feedback and thoughts. > >> > >> Kind regards, > >> Gustavo > >> > >> [1] > >> > https://cwiki.apache.org/confluence/display/FLINK/FLIP-568%3A+Strict+BYTES-to-STRING+CAST+with+UTF-8+Validation+Utilities > >> >
