Re: [E] Re: [DISCUSS] UTF-8 validation for string SerDe across sketches

Hyeonho Kim Thu, 12 Mar 2026 00:07:55 -0700

Thanks, everyone, for the discussion.

My understanding from the discussion so far is that the policy direction
should be:


   -

   For string-containing sketches, interoperability pitfalls related to
   UTF-8 encoding should be documented clearly.
   -

   Optional helper tools for common cases may be useful, but they do not
   seem essential to the policy itself.

Given that, I think the most practical next step is documentation.

As far as I know, this is not documented clearly today for C++ and Go, so I
will follow up by proposing documentation updates there as a next step.

On Sun, Mar 8, 2026 at 2:45 PM Lee Rhodes <[email protected]> wrote:

> This has been a helpful discussion.  My thinking about this has also
> changed, but for a different reason.
>
> My proposal to have an encoding standard for strings came from a (noble?)
> desire to help protect our users from footguns.
>
> However, ensuring compatibility between any two sketches that have been
> independently loaded is a much deeper can-of-worms than we have discussed
> here:
>
>    - Imagine a merge of two sketches inadvertently fed strings using
>    different character encodings. It doesn't matter if the sketches originated
>    from different programming languages or not.
>    - Converting a string to a hash doesn't change this.  This means
>    virtually all of our sketches could be vulnerable to this user mistake and
>    not just our container sketches.
>    - Natural numeric instability of doubles could also create similar
>    silent failures if the user is not careful.
>
> I don't think that there is any way we can programmatically protect our
> users from all of these possible mistakes.
>
> Having said that, providing some useful tools that could help the user
> validate UTF-8 strings might be useful. It won't protect against all of the
> potential user mistakes of this type, just perhaps some common ones.
>
> But if we decide not to do anything programmatic, we could at least
> provide sufficient warnings in the documentation of these possible, and
> easy to make pitfalls.  We don't have to do this right away, but as the
> various libraries move to new versions, this kind of documentation should
> be on the list to add.
>
>
>
>
>
> On Sat, Mar 7, 2026 at 2:57 AM Hyeonho Kim <[email protected]> wrote:
>
>> Thanks.
>>
>> After thinking more about it and reviewing the C++ and Go code more
>> closely, my view has changed.
>>
>> I now think that changing the serialization format just to preserve UTF-8
>> validation behavior for C++ and Go would be too heavy. If we do not change
>> the serialization format, then we cannot fully preserve behavioral
>> consistency across serialization/deserialization anyway.
>>
>> At the same time, I do not think we should ignore language-independent
>> sketch images for string-containing sketches.
>> So my current view is that we should keep the sketch format unchanged and
>> leave `update()` behavior unchanged.
>>
>> If possible, we provide an explicit portability path through UTF-8
>> validating SerDe choices.
>> If that is not desirable, then at minimum I think we should document this
>> point clearly. In particular, I think we should document clearly that
>> cross-language portability for string-containing sketches depends on using
>> valid UTF-8.
>>
>>
>> On Sat, Mar 7, 2026 at 4:47 PM Alexander Saydakov via dev <
>> [email protected]> wrote:
>>
>>> I would reiterate that in my view sketches should not care about
>>> validation.
>>> If the user desires validation, he can instantiate, say,
>>> frequent_items_sketch<utf8_string> instead of
>>> frequent_items_sketch<std::string>.
>>> utf8_string should perform validation.
>>>
>>> On Fri, Mar 6, 2026 at 10:17 PM Hyeonho Kim <[email protected]> wrote:
>>>
>>>> Hi all,
>>>>
>>>> I realized there is one more design point that may need discussion.
>>>>
>>>> For sketches that validate UTF-8 at update() time by default, with an
>>>> explicit opt-out, that setting affects the behavior of future update()
>>>> calls even after deserialization.
>>>>
>>>> So there seems to be a broader design choice here for string-specific
>>>> sketches / update APIs:
>>>>
>>>>    1.
>>>>
>>>>    Treat the UTF-8 validation setting as part of the serialized sketch
>>>>    state, so it is preserved across serialization/deserialization.
>>>>    2.
>>>>
>>>>    Treat it as a runtime policy only, in which case it would need to
>>>>    be specified again after deserialization (or when constructing a new
>>>>    sketch).
>>>>
>>>> The first option would preserve behavioral consistency, so it seems
>>>> like the more semantically consistent choice. However, it also seems like a
>>>> much bigger decision in practice, since it would require a serialization
>>>> format change / versioning.
>>>>
>>>> The second option avoids changing the serialized format, but a
>>>> deserialized sketch may not behave exactly the same for future update()
>>>> calls unless the caller explicitly restores the same policy.
>>>>
>>>> What do others think?
>>>>
>>>> On Wed, Mar 4, 2026 at 5:30 AM Lee Rhodes <[email protected]> wrote:
>>>>
>>>>> I agree. Here is a proposed wording that is a sort of a "policy" way
>>>>> to think about this:
>>>>>
>>>>> For "container" type sketches that can potentially retain Strings:
>>>>>
>>>>>    - If a sketch has the word "string" as part of its name, then
>>>>>    UTF-8 validation at update() should be the default with an explicit
>>>>>    opt-out.  Example: ArrayOfStringsTupleSketch.
>>>>>    - If an update method to a sketch has an explicit "string"
>>>>>    parameter, then UTF-8 validation should be the default with an explicit
>>>>>    opt-out.  Example FdtSketch::update(String[]).
>>>>>    - Otherwise, if a sketch or update method accepts just a generic
>>>>>    type T, then we will provide a UTF-8 validating "SerDe" object that 
>>>>> can be
>>>>>    optionally used for type T.
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Mar 3, 2026 at 7:32 AM Hyeonho Kim <[email protected]> wrote:
>>>>>
>>>>>> Hi all!
>>>>>>
>>>>>> Unless there are objections, I propose the following:
>>>>>>
>>>>>>    1.
>>>>>>
>>>>>>    Introduce an opt-in UTF-8 validating SerDe for std::string
>>>>>>    (validation OFF by default).
>>>>>>    2.
>>>>>>
>>>>>>    For AoS string items, enable UTF-8 validation at update() by
>>>>>>    default, with an explicit opt-out.
>>>>>>
>>>>>> If this direction looks reasonable, I will proceed accordingly in the
>>>>>> AoS PR and follow up with a separate PR for the SerDe option.
>>>>>>
>>>>>>
>>>>>> Thanks,
>>>>>>
>>>>>> Hyeonho
>>>>>>
>>>>>> On Fri, Feb 20, 2026 at 11:59 PM Hyeonho Kim <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> Thanks all for the feedback.
>>>>>>>
>>>>>>>
>>>>>>> We can preserve backward compatibility for existing C++ users while
>>>>>>> also providing a clear path for cross-language portability.
>>>>>>>
>>>>>>> How do you think about the following approach?
>>>>>>>
>>>>>>> - SerDe with string: Add an option to validate whether the string
>>>>>>> contains valid UTF-8 sequences. The default would be validation OFF to
>>>>>>> preserve existing compatibility.
>>>>>>>
>>>>>>> - AoS tuple sketch: Validate UTF-8 at the update method (fail-fast).
>>>>>>> Enabling validation by default, with an explicit opt-out for users who 
>>>>>>> want.
>>>>>>>
>>>>>>>
>>>>>>> For DS-Go, we can follow the same policy as C++.
>>>>>>>
>>>>>>>
>>>>>>> Feedback is welcome.
>>>>>>>
>>>>>>> On Wed, Feb 18, 2026 at 3:24 AM Jon Malkin <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Gonna agree with Alexander here. I think we should provide a serde
>>>>>>>> option for c++, but that we should not reject non-UTF-8 strings.
>>>>>>>>
>>>>>>>> That wouldn’t just be an API-breaking change. It would break
>>>>>>>> compatibility of c++ with itself for anyone who doesn’t need language
>>>>>>>> portability.
>>>>>>>>
>>>>>>>> A separate utf8_serde option gets my vote.
>>>>>>>>
>>>>>>>>   jon
>>>>>>>>
>>>>>>>> On Tue, Feb 17, 2026 at 10:12 AM Alexander Saydakov via dev <
>>>>>>>> [email protected]> wrote:
>>>>>>>>
>>>>>>>>> Regarding C++, I would think that the easiest approach is to
>>>>>>>>> instruct the user to use a UTF8-validating string substitute instead 
>>>>>>>>> of
>>>>>>>>> std::string.
>>>>>>>>> I am not sure whether we should provide such a thing or let the
>>>>>>>>> user to come up with their own implementation.
>>>>>>>>> Consider having a uft8_string that would validate the input in the
>>>>>>>>> constrtuctor but otherwise identical to std::string
>>>>>>>>> So the user can instantiate, for example,
>>>>>>>>> frequent_items_sketch<utf8_string> instead of
>>>>>>>>> frequent_items_sketch<std::string> if validation is necessary.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sun, Feb 15, 2026 at 8:38 PM Hyeonho Kim <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> Thanks for the feedback. I agree that for container sketches that
>>>>>>>>>> retain and serialize strings, we should validate that string 
>>>>>>>>>> payloads are
>>>>>>>>>> valid UTF-8 sequences to preserve cross-language portability.
>>>>>>>>>>
>>>>>>>>>> On *where* to validate in DS-CPP: validating at update() (ingest
>>>>>>>>>> time) is attractive because it is fail-fast, but it also adds 
>>>>>>>>>> additional
>>>>>>>>>> cost on the hot path. If the community is comfortable with that 
>>>>>>>>>> overhead
>>>>>>>>>> for string-based container sketches, I’m happy to pursue the
>>>>>>>>>> update()-time validation approach.
>>>>>>>>>>
>>>>>>>>>> If performance sensitivity is a concern, an alternative would be
>>>>>>>>>> to always validate at (de)serialization boundaries (to guarantee 
>>>>>>>>>> artifact
>>>>>>>>>> correctness), and optionally provide a “fail-fast” mode that enables
>>>>>>>>>> validation at update() as well.
>>>>>>>>>>
>>>>>>>>>> For DS-Go, we can follow the same policy. Go’s situation is a bit
>>>>>>>>>> simpler in implementation because it provides UTF-8 validation in the
>>>>>>>>>> standard library (unicode/utf8), so we wouldn’t need an external
>>>>>>>>>> dependency for the validator.
>>>>>>>>>>
>>>>>>>>>> On Mon, Feb 16, 2026 at 6:29 AM Lee Rhodes <[email protected]>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> This issue, raised by Hyeonho Kim, relates to sketches that
>>>>>>>>>>> allow a user to update the sketch with a string and the sketch also 
>>>>>>>>>>> retains
>>>>>>>>>>> within the sketch a sample of the input strings seen. When 
>>>>>>>>>>> serialized,
>>>>>>>>>>> there is an implicit assumption that another user, possibly in a 
>>>>>>>>>>> different
>>>>>>>>>>> language, can successfully deserialize those sketch images. These 
>>>>>>>>>>> sketches
>>>>>>>>>>> include KLL, REQ, Classic Quantiles, Sampling, FrequentItems,
>>>>>>>>>>> and Tuple. We informally call these "container" sketches, because 
>>>>>>>>>>> they
>>>>>>>>>>> contain actual samples from the input stream.  HLL, Theta, CPC,
>>>>>>>>>>> BloomFilter, etc., are not container sketches.
>>>>>>>>>>>
>>>>>>>>>>> In the DS-Java library, all container sketches that allow
>>>>>>>>>>> strings always use UTF_8. So the sketch images produced will 
>>>>>>>>>>> contain proper
>>>>>>>>>>> UTF_8 sequences.
>>>>>>>>>>>
>>>>>>>>>>> In the DS-CPP library, all the various data types are abstracted
>>>>>>>>>>> via templates. The serialization operation is declared similar to
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *sketch<T>::serialize(std::ostream& os, const SerDe& sd)where T *is
>>>>>>>>>>> the item type*, os is the output stream and sd* *is the SerDe
>>>>>>>>>>> that performs the conversion to bytes. *
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> If the user wants to use an item of type string, *T* would
>>>>>>>>>>> typically be of type *std::string*, which is just a blob of
>>>>>>>>>>> bytes and no requirement that it is UTF_8.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> So far, we have trusted users of the library to know that if
>>>>>>>>>>> they update one of these container classes with a type *T,*
>>>>>>>>>>> that the downstream user can successfully decode it. But this could 
>>>>>>>>>>> be
>>>>>>>>>>> catastrophic:  A downstream user of a sketch image could be 
>>>>>>>>>>> separated from
>>>>>>>>>>> the creation of the sketch image by years and be using a different
>>>>>>>>>>> language.
>>>>>>>>>>>
>>>>>>>>>>> One of the big advantages of our DataSketches project is that
>>>>>>>>>>> our serialization images should be language and platform 
>>>>>>>>>>> independent,
>>>>>>>>>>> allowing cross-language and cross platform interchange of sketches.
>>>>>>>>>>>
>>>>>>>>>>> Hyeonho Kim's recommendation makes sense: For serialized sketch
>>>>>>>>>>> images that contain strings, those strings must be UTF_8.
>>>>>>>>>>>
>>>>>>>>>>> So how do we implement that?  My thoughts are as follows:
>>>>>>>>>>>
>>>>>>>>>>>    1. We should document now in the website and in appropriate
>>>>>>>>>>>    places in the library the potential danger of not using UTF_8 
>>>>>>>>>>> strings. (At
>>>>>>>>>>>    least until we have a more robust solution)
>>>>>>>>>>>    2. I think implementing validation checks on UTF_8 strings
>>>>>>>>>>>    at the SerDe boundaries may be too late.  A user could have 
>>>>>>>>>>> processed a
>>>>>>>>>>>    large stream of data only to discover a failure at serialization 
>>>>>>>>>>> time,
>>>>>>>>>>>    which could be much later in time.  The other possibility would 
>>>>>>>>>>> be to
>>>>>>>>>>>    validate the strings at the input into the sketch, typically in 
>>>>>>>>>>> the *update()
>>>>>>>>>>>    *method.
>>>>>>>>>>>    3. For C++, there are 3rd party libraries that specialize in
>>>>>>>>>>>    UTF_8 validation, including ICU
>>>>>>>>>>>    
>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/unicode-org/icu__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpPC5K7q2A$>
>>>>>>>>>>>    , UTF8-CPP
>>>>>>>>>>>    
>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/nemtrif/utfcpp__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpNk0hS7xg$>
>>>>>>>>>>>    and simjson
>>>>>>>>>>>    
>>>>>>>>>>> <https://urldefense.com/v3/__https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpMVUko1NQ$>.
>>>>>>>>>>>    (These have standard licensing). From what I've read, UTF-8 
>>>>>>>>>>> validation, if
>>>>>>>>>>>    done correctly, can be done very fast, with only a small section 
>>>>>>>>>>> of code.
>>>>>>>>>>>    4. I am not sure what the solutions are for Rust or Go.
>>>>>>>>>>>
>>>>>>>>>>> I welcome your feedback.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Feb 14, 2026 at 1:47 AM tison <[email protected]>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> This PR [1] of datasketches-rust demonstrates how the Rust impl
>>>>>>>>>>>> deserializes String values.
>>>>>>>>>>>>
>>>>>>>>>>>> [1] https://github.com/apache/datasketches-rust/pull/82
>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/apache/datasketches-rust/pull/82__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpN3yo3d3w$>
>>>>>>>>>>>>
>>>>>>>>>>>> If it's std::string::String, then it must be of UTF-8 encoding.
>>>>>>>>>>>> And we check the encoding on deserialization.
>>>>>>>>>>>>
>>>>>>>>>>>> However, the Rust ecosystem also supports "strings" that do not
>>>>>>>>>>>> use UTF-8, such as BStr.
>>>>>>>>>>>>
>>>>>>>>>>>> So, my opinions are:
>>>>>>>>>>>>
>>>>>>>>>>>> 1. It's good to assume serialized string data to be valid UTF-8.
>>>>>>>>>>>> 2. Even if it isn't, for datasketches-rust, users should be
>>>>>>>>>>>> able to choose a proper type to deserialize the bytes into a type 
>>>>>>>>>>>> that
>>>>>>>>>>>> doesn't require UTF-8 encoding.
>>>>>>>>>>>>
>>>>>>>>>>>> Best,
>>>>>>>>>>>> tison.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Hyeonho Kim <[email protected]> 于2026年2月14日周六 17:24写道：
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>
>>>>>>>>>>>>> While working on UTF-8 validation for the AoS tuple sketch in
>>>>>>>>>>>>> C++ (ref: https://github.com/apache/datasketches-cpp/pull/476
>>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/apache/datasketches-cpp/pull/476__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpPslrtDnQ$>),
>>>>>>>>>>>>> a broader design question came up that may affect multiple 
>>>>>>>>>>>>> sketches.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Based on my current understanding:
>>>>>>>>>>>>>
>>>>>>>>>>>>> - In datasketches-java, string serialization already produces
>>>>>>>>>>>>> valid UTF-8 bytes via getBytes(StandardCharsets.UTF_8). So 
>>>>>>>>>>>>> Java-generated
>>>>>>>>>>>>> artifacts already assume valid UTF-8 string encoding.
>>>>>>>>>>>>> - Rust and Python string types represent Unicode text and can
>>>>>>>>>>>>> be encoded to UTF-8. Please correct me if I am mistaken. (I don't 
>>>>>>>>>>>>> know Rust
>>>>>>>>>>>>> and Python well)
>>>>>>>>>>>>> - In Go, string is a byte sequence and may contain invalid
>>>>>>>>>>>>> UTF-8 unless explicitly validated. So during serialization, it 
>>>>>>>>>>>>> may produce
>>>>>>>>>>>>> invalid UTF-8 sequences.
>>>>>>>>>>>>> - In C++, std::string is also a byte container and does not
>>>>>>>>>>>>> enforce UTF-8 validity. So during serialization, it may produce 
>>>>>>>>>>>>> invalid
>>>>>>>>>>>>> UTF-8 sequences.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If I am mistaken on any of these points, I would appreciate
>>>>>>>>>>>>> corrections.
>>>>>>>>>>>>>
>>>>>>>>>>>>> If we want to maintain cross-language portability for
>>>>>>>>>>>>> serialized artifacts, one possible approach would be to ensure 
>>>>>>>>>>>>> that any
>>>>>>>>>>>>> serialized string data is valid UTF-8. This could potentially 
>>>>>>>>>>>>> apply to any
>>>>>>>>>>>>> sketches that serialize or deserialize string data.
>>>>>>>>>>>>>
>>>>>>>>>>>>> There seem to be several possible approaches:
>>>>>>>>>>>>> - Validate UTF-8 at serialization boundaries
>>>>>>>>>>>>> - Document that input strings must be valid UTF-8 and rely on
>>>>>>>>>>>>> caller discipline
>>>>>>>>>>>>>
>>>>>>>>>>>>> At this point I am not proposing a specific solution. I would
>>>>>>>>>>>>> like to hear opinions from the community on: We want to require 
>>>>>>>>>>>>> serialized
>>>>>>>>>>>>> string data to be valid UTF-8 for cross-language portability
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Hyeonho
>>>>>>>>>>>>>
>>>>>>>>>>>>

Re: [E] Re: [DISCUSS] UTF-8 validation for string SerDe across sketches

Reply via email to