Hi, Gustavo. 
Looks good to update the Internal API independently. +1.




--

    Best!
    Xuyang



At 2026-03-23 20:50:02, "Gustavo de Morais" <[email protected]> wrote:
>Hi Sergey,
>
>Thanks for the message! I took a look at our support for UTF_16 and I don't
>think any additional UTF_16 support is necessary for the scope of this
>FLIP. For UTF-16 users should use Decode instead of Cast. Flink stores all
>strings as UTF-8 internally, so DECODE(bytes, 'UTF-16') already handles
>that case - it converts to a Java String and re-encodes as UTF-8 before
>storage. Since the internal representation is always UTF-8, the validation
>problem is fundamentally a UTF-8 concern and the functions are correctly
>named. Adding validation functions support for all the character sets we
>support for DECODE (‘US-ASCII’, ‘ISO-8859-1’, ‘UTF-8’, ‘UTF-16BE’,
>‘UTF-16LE’, ‘UTF-16) might be something useful for some less frequent use
>cases but I'd say this is out-of-scope for this FLIP.
>
>Kind regards,
>Gustavo
>
>
>On Fri, 20 Mar 2026 at 20:24, Sergey Nuyanzin <[email protected]> wrote:
>
>> Hi Gustavo
>> thanks for the proposal
>>
>> I noticed that you are proposing usage of UTF8 in names (default cast
>> to string is also using utf8)
>> however I wonder if it makes sense to introduce similar utf16 similar
>> functions as Flink supports this as well?
>>
>> On Fri, Mar 20, 2026 at 8:09 PM Gustavo de Morais
>> <[email protected]> wrote:
>> >
>> > Hi Xuyang and Timo,
>> >
>> > Thanks for the positive feedback! Regarding your suggestions, Xuyang:
>> >
>> > 1. Yes, good point - we should add the fromUtf8Bytes(byte[], int, int)
>> > overload as well.
>> > 2. This is also relevant. If we want to do validation during ingestion
>> > time, this might have performance implications. Since these are @Internal
>> > APIs, they can be changed independently from the FLIP afterwards if it
>> > makes sense. What's your opinion?
>> >
>> > Kind regards,
>> > Gustavo
>> >
>> > On Fri, 20 Mar 2026 at 08:44, Xuyang <[email protected]> wrote:
>> >
>> > > Hi, Gustavo.
>> > > Great catch! Thanks for driving this FLIP. Overall LGTM. I just have
>> two
>> > > minor points I'd like to confirm with you.
>> > > 1. Should we also add the overload function `fromUtf8Bytes(byte[], int,
>> > > int)` in StringData?
>> > > 2. Callers like `ColumnarRowData#getString` and
>> > > `ColumnarArrayData#getString`  call `StringData.fromBytes` directly.
>> Should
>> > > these call sites be migrated in a follow-up, or intentionally left
>> as-is?
>> > >
>> > >
>> > >
>> > >
>> > >
>> > > --
>> > >
>> > >     Best!
>> > >     Xuyang
>> > >
>> > >
>> > >
>> > > At 2026-03-19 22:37:28, "Timo Walther" <[email protected]> wrote:
>> > > >Hi Gustavo,
>> > > >
>> > > >thank you for this excellent design document. And thanks for
>> discovering
>> > > >this data loss and driving the investigation. We should definitely fix
>> > > >this shortcoming. Also looking at other vendors, it is definitly a
>> cause
>> > > >for false assumptions that lead to hard-to-debug inconsistencies.
>> > > >
>> > > >+1 for this proposal.
>> > > >
>> > > >Cheers,
>> > > >Timo
>> > > >
>> > > >
>> > > >On 19.03.26 15:23, Gustavo de Morais wrote:
>> > > >> Hi everyone,
>> > > >>
>> > > >> Currently, CAST(bytes AS STRING) silently replaces any invalid UTF-8
>> > > byte
>> > > >> with U+FFFD (?). The substitution is irreversible and produces no
>> > > warning -
>> > > >> the pipeline keeps running while data is permanently corrupted
>> > > >> downstream. This also means that a CAST from BYTES → STRING → BYTES
>> is
>> > > not
>> > > >> idempotent, which prevents the engine from applying certain
>> > > optimizations.
>> > > >> For example, for preserving upsert keys after such CASTs.
>> > > >>
>> > > >> I'd like to start a discussion around defining and improving the
>> default
>> > > >> behavior. I've written a short FLIP [1] proposing new utility
>> functions
>> > > to
>> > > >> handle this explicitly - similar to what other engines like Spark
>> > > already
>> > > >> do - and changing the default behavior to throw an error instead of
>> > > >> silently corrupting data, while giving users clear options to deal
>> with
>> > > >> invalid bytes.
>> > > >>
>> > > >> Looking forward to your feedback and thoughts.
>> > > >>
>> > > >> Kind regards,
>> > > >> Gustavo
>> > > >>
>> > > >> [1]
>> > > >>
>> > >
>> https://cwiki.apache.org/confluence/display/FLINK/FLIP-568%3A+Strict+BYTES-to-STRING+CAST+with+UTF-8+Validation+Utilities
>> > > >>
>> > >
>>
>>
>>
>> --
>> Best regards,
>> Sergey
>>

Reply via email to