Re: [E] Re: [DISCUSS] UTF-8 validation for string SerDe across sketches

Hyeonho Kim Sat, 07 Mar 2026 02:55:54 -0800

Thanks.

After thinking more about it and reviewing the C++ and Go code more
closely, my view has changed.


I now think that changing the serialization format just to preserve UTF-8
validation behavior for C++ and Go would be too heavy. If we do not change
the serialization format, then we cannot fully preserve behavioral
consistency across serialization/deserialization anyway.

At the same time, I do not think we should ignore language-independent
sketch images for string-containing sketches.
So my current view is that we should keep the sketch format unchanged and
leave `update()` behavior unchanged.

If possible, we provide an explicit portability path through UTF-8
validating SerDe choices.
If that is not desirable, then at minimum I think we should document this
point clearly. In particular, I think we should document clearly that
cross-language portability for string-containing sketches depends on using
valid UTF-8.


On Sat, Mar 7, 2026 at 4:47 PM Alexander Saydakov via dev <
[email protected]> wrote:

> I would reiterate that in my view sketches should not care about
> validation.
> If the user desires validation, he can instantiate, say,
> frequent_items_sketch<utf8_string> instead of
> frequent_items_sketch<std::string>.
> utf8_string should perform validation.
>
> On Fri, Mar 6, 2026 at 10:17 PM Hyeonho Kim <[email protected]> wrote:
>
>> Hi all,
>>
>> I realized there is one more design point that may need discussion.
>>
>> For sketches that validate UTF-8 at update() time by default, with an
>> explicit opt-out, that setting affects the behavior of future update()
>> calls even after deserialization.
>>
>> So there seems to be a broader design choice here for string-specific
>> sketches / update APIs:
>>
>>    1.
>>
>>    Treat the UTF-8 validation setting as part of the serialized sketch
>>    state, so it is preserved across serialization/deserialization.
>>    2.
>>
>>    Treat it as a runtime policy only, in which case it would need to be
>>    specified again after deserialization (or when constructing a new sketch).
>>
>> The first option would preserve behavioral consistency, so it seems like
>> the more semantically consistent choice. However, it also seems like a much
>> bigger decision in practice, since it would require a serialization format
>> change / versioning.
>>
>> The second option avoids changing the serialized format, but a
>> deserialized sketch may not behave exactly the same for future update()
>> calls unless the caller explicitly restores the same policy.
>>
>> What do others think?
>>
>> On Wed, Mar 4, 2026 at 5:30 AM Lee Rhodes <[email protected]> wrote:
>>
>>> I agree. Here is a proposed wording that is a sort of a "policy" way to
>>> think about this:
>>>
>>> For "container" type sketches that can potentially retain Strings:
>>>
>>>    - If a sketch has the word "string" as part of its name, then UTF-8
>>>    validation at update() should be the default with an explicit
>>>    opt-out.  Example: ArrayOfStringsTupleSketch.
>>>    - If an update method to a sketch has an explicit "string"
>>>    parameter, then UTF-8 validation should be the default with an explicit
>>>    opt-out.  Example FdtSketch::update(String[]).
>>>    - Otherwise, if a sketch or update method accepts just a generic
>>>    type T, then we will provide a UTF-8 validating "SerDe" object that can 
>>> be
>>>    optionally used for type T.
>>>
>>>
>>>
>>> On Tue, Mar 3, 2026 at 7:32 AM Hyeonho Kim <[email protected]> wrote:
>>>
>>>> Hi all!
>>>>
>>>> Unless there are objections, I propose the following:
>>>>
>>>>    1.
>>>>
>>>>    Introduce an opt-in UTF-8 validating SerDe for std::string
>>>>    (validation OFF by default).
>>>>    2.
>>>>
>>>>    For AoS string items, enable UTF-8 validation at update() by
>>>>    default, with an explicit opt-out.
>>>>
>>>> If this direction looks reasonable, I will proceed accordingly in the
>>>> AoS PR and follow up with a separate PR for the SerDe option.
>>>>
>>>>
>>>> Thanks,
>>>>
>>>> Hyeonho
>>>>
>>>> On Fri, Feb 20, 2026 at 11:59 PM Hyeonho Kim <[email protected]>
>>>> wrote:
>>>>
>>>>> Thanks all for the feedback.
>>>>>
>>>>>
>>>>> We can preserve backward compatibility for existing C++ users while
>>>>> also providing a clear path for cross-language portability.
>>>>>
>>>>> How do you think about the following approach?
>>>>>
>>>>> - SerDe with string: Add an option to validate whether the string
>>>>> contains valid UTF-8 sequences. The default would be validation OFF to
>>>>> preserve existing compatibility.
>>>>>
>>>>> - AoS tuple sketch: Validate UTF-8 at the update method (fail-fast).
>>>>> Enabling validation by default, with an explicit opt-out for users who 
>>>>> want.
>>>>>
>>>>>
>>>>> For DS-Go, we can follow the same policy as C++.
>>>>>
>>>>>
>>>>> Feedback is welcome.
>>>>>
>>>>> On Wed, Feb 18, 2026 at 3:24 AM Jon Malkin <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Gonna agree with Alexander here. I think we should provide a serde
>>>>>> option for c++, but that we should not reject non-UTF-8 strings.
>>>>>>
>>>>>> That wouldn’t just be an API-breaking change. It would break
>>>>>> compatibility of c++ with itself for anyone who doesn’t need language
>>>>>> portability.
>>>>>>
>>>>>> A separate utf8_serde option gets my vote.
>>>>>>
>>>>>>   jon
>>>>>>
>>>>>> On Tue, Feb 17, 2026 at 10:12 AM Alexander Saydakov via dev <
>>>>>> [email protected]> wrote:
>>>>>>
>>>>>>> Regarding C++, I would think that the easiest approach is to
>>>>>>> instruct the user to use a UTF8-validating string substitute instead of
>>>>>>> std::string.
>>>>>>> I am not sure whether we should provide such a thing or let the user
>>>>>>> to come up with their own implementation.
>>>>>>> Consider having a uft8_string that would validate the input in the
>>>>>>> constrtuctor but otherwise identical to std::string
>>>>>>> So the user can instantiate, for example,
>>>>>>> frequent_items_sketch<utf8_string> instead of
>>>>>>> frequent_items_sketch<std::string> if validation is necessary.
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Feb 15, 2026 at 8:38 PM Hyeonho Kim <[email protected]>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Thanks for the feedback. I agree that for container sketches that
>>>>>>>> retain and serialize strings, we should validate that string payloads 
>>>>>>>> are
>>>>>>>> valid UTF-8 sequences to preserve cross-language portability.
>>>>>>>>
>>>>>>>> On *where* to validate in DS-CPP: validating at update() (ingest
>>>>>>>> time) is attractive because it is fail-fast, but it also adds 
>>>>>>>> additional
>>>>>>>> cost on the hot path. If the community is comfortable with that 
>>>>>>>> overhead
>>>>>>>> for string-based container sketches, I’m happy to pursue the
>>>>>>>> update()-time validation approach.
>>>>>>>>
>>>>>>>> If performance sensitivity is a concern, an alternative would be to
>>>>>>>> always validate at (de)serialization boundaries (to guarantee artifact
>>>>>>>> correctness), and optionally provide a “fail-fast” mode that enables
>>>>>>>> validation at update() as well.
>>>>>>>>
>>>>>>>> For DS-Go, we can follow the same policy. Go’s situation is a bit
>>>>>>>> simpler in implementation because it provides UTF-8 validation in the
>>>>>>>> standard library (unicode/utf8), so we wouldn’t need an external
>>>>>>>> dependency for the validator.
>>>>>>>>
>>>>>>>> On Mon, Feb 16, 2026 at 6:29 AM Lee Rhodes <[email protected]>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> This issue, raised by Hyeonho Kim, relates to sketches that allow
>>>>>>>>> a user to update the sketch with a string and the sketch also retains
>>>>>>>>> within the sketch a sample of the input strings seen. When serialized,
>>>>>>>>> there is an implicit assumption that another user, possibly in a 
>>>>>>>>> different
>>>>>>>>> language, can successfully deserialize those sketch images. These 
>>>>>>>>> sketches
>>>>>>>>> include KLL, REQ, Classic Quantiles, Sampling, FrequentItems, and
>>>>>>>>> Tuple. We informally call these "container" sketches, because they 
>>>>>>>>> contain
>>>>>>>>> actual samples from the input stream.  HLL, Theta, CPC, BloomFilter, 
>>>>>>>>> etc.,
>>>>>>>>> are not container sketches.
>>>>>>>>>
>>>>>>>>> In the DS-Java library, all container sketches that allow strings
>>>>>>>>> always use UTF_8. So the sketch images produced will contain proper 
>>>>>>>>> UTF_8
>>>>>>>>> sequences.
>>>>>>>>>
>>>>>>>>> In the DS-CPP library, all the various data types are abstracted
>>>>>>>>> via templates. The serialization operation is declared similar to
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> *sketch<T>::serialize(std::ostream& os, const SerDe& sd)where T *is
>>>>>>>>> the item type*, os is the output stream and sd* *is the SerDe
>>>>>>>>> that performs the conversion to bytes. *
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> If the user wants to use an item of type string, *T* would
>>>>>>>>> typically be of type *std::string*, which is just a blob of bytes
>>>>>>>>> and no requirement that it is UTF_8.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> So far, we have trusted users of the library to know that if they
>>>>>>>>> update one of these container classes with a type *T,* that the
>>>>>>>>> downstream user can successfully decode it. But this could be
>>>>>>>>> catastrophic:  A downstream user of a sketch image could be separated 
>>>>>>>>> from
>>>>>>>>> the creation of the sketch image by years and be using a different
>>>>>>>>> language.
>>>>>>>>>
>>>>>>>>> One of the big advantages of our DataSketches project is that our
>>>>>>>>> serialization images should be language and platform independent, 
>>>>>>>>> allowing
>>>>>>>>> cross-language and cross platform interchange of sketches.
>>>>>>>>>
>>>>>>>>> Hyeonho Kim's recommendation makes sense: For serialized sketch
>>>>>>>>> images that contain strings, those strings must be UTF_8.
>>>>>>>>>
>>>>>>>>> So how do we implement that?  My thoughts are as follows:
>>>>>>>>>
>>>>>>>>>    1. We should document now in the website and in appropriate
>>>>>>>>>    places in the library the potential danger of not using UTF_8 
>>>>>>>>> strings. (At
>>>>>>>>>    least until we have a more robust solution)
>>>>>>>>>    2. I think implementing validation checks on UTF_8 strings at
>>>>>>>>>    the SerDe boundaries may be too late.  A user could have processed 
>>>>>>>>> a large
>>>>>>>>>    stream of data only to discover a failure at serialization time, 
>>>>>>>>> which
>>>>>>>>>    could be much later in time.  The other possibility would be to 
>>>>>>>>> validate
>>>>>>>>>    the strings at the input into the sketch, typically in the 
>>>>>>>>> *update()
>>>>>>>>>    *method.
>>>>>>>>>    3. For C++, there are 3rd party libraries that specialize in
>>>>>>>>>    UTF_8 validation, including ICU
>>>>>>>>>    
>>>>>>>>> <https://urldefense.com/v3/__https://github.com/unicode-org/icu__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpPC5K7q2A$>
>>>>>>>>>    , UTF8-CPP
>>>>>>>>>    
>>>>>>>>> <https://urldefense.com/v3/__https://github.com/nemtrif/utfcpp__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpNk0hS7xg$>
>>>>>>>>>    and simjson
>>>>>>>>>    
>>>>>>>>> <https://urldefense.com/v3/__https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpMVUko1NQ$>.
>>>>>>>>>    (These have standard licensing). From what I've read, UTF-8 
>>>>>>>>> validation, if
>>>>>>>>>    done correctly, can be done very fast, with only a small section 
>>>>>>>>> of code.
>>>>>>>>>    4. I am not sure what the solutions are for Rust or Go.
>>>>>>>>>
>>>>>>>>> I welcome your feedback.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Sat, Feb 14, 2026 at 1:47 AM tison <[email protected]>
>>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>> This PR [1] of datasketches-rust demonstrates how the Rust impl
>>>>>>>>>> deserializes String values.
>>>>>>>>>>
>>>>>>>>>> [1] https://github.com/apache/datasketches-rust/pull/82
>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/apache/datasketches-rust/pull/82__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpN3yo3d3w$>
>>>>>>>>>>
>>>>>>>>>> If it's std::string::String, then it must be of UTF-8 encoding.
>>>>>>>>>> And we check the encoding on deserialization.
>>>>>>>>>>
>>>>>>>>>> However, the Rust ecosystem also supports "strings" that do not
>>>>>>>>>> use UTF-8, such as BStr.
>>>>>>>>>>
>>>>>>>>>> So, my opinions are:
>>>>>>>>>>
>>>>>>>>>> 1. It's good to assume serialized string data to be valid UTF-8.
>>>>>>>>>> 2. Even if it isn't, for datasketches-rust, users should be able
>>>>>>>>>> to choose a proper type to deserialize the bytes into a type that 
>>>>>>>>>> doesn't
>>>>>>>>>> require UTF-8 encoding.
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> tison.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Hyeonho Kim <[email protected]> 于2026年2月14日周六 17:24写道：
>>>>>>>>>>
>>>>>>>>>>> Hi all,
>>>>>>>>>>>
>>>>>>>>>>> While working on UTF-8 validation for the AoS tuple sketch in
>>>>>>>>>>> C++ (ref: https://github.com/apache/datasketches-cpp/pull/476
>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/apache/datasketches-cpp/pull/476__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpPslrtDnQ$>),
>>>>>>>>>>> a broader design question came up that may affect multiple sketches.
>>>>>>>>>>>
>>>>>>>>>>> Based on my current understanding:
>>>>>>>>>>>
>>>>>>>>>>> - In datasketches-java, string serialization already produces
>>>>>>>>>>> valid UTF-8 bytes via getBytes(StandardCharsets.UTF_8). So 
>>>>>>>>>>> Java-generated
>>>>>>>>>>> artifacts already assume valid UTF-8 string encoding.
>>>>>>>>>>> - Rust and Python string types represent Unicode text and can be
>>>>>>>>>>> encoded to UTF-8. Please correct me if I am mistaken. (I don't know 
>>>>>>>>>>> Rust
>>>>>>>>>>> and Python well)
>>>>>>>>>>> - In Go, string is a byte sequence and may contain invalid UTF-8
>>>>>>>>>>> unless explicitly validated. So during serialization, it may produce
>>>>>>>>>>> invalid UTF-8 sequences.
>>>>>>>>>>> - In C++, std::string is also a byte container and does not
>>>>>>>>>>> enforce UTF-8 validity. So during serialization, it may produce 
>>>>>>>>>>> invalid
>>>>>>>>>>> UTF-8 sequences.
>>>>>>>>>>>
>>>>>>>>>>> If I am mistaken on any of these points, I would appreciate
>>>>>>>>>>> corrections.
>>>>>>>>>>>
>>>>>>>>>>> If we want to maintain cross-language portability for serialized
>>>>>>>>>>> artifacts, one possible approach would be to ensure that any 
>>>>>>>>>>> serialized
>>>>>>>>>>> string data is valid UTF-8. This could potentially apply to any 
>>>>>>>>>>> sketches
>>>>>>>>>>> that serialize or deserialize string data.
>>>>>>>>>>>
>>>>>>>>>>> There seem to be several possible approaches:
>>>>>>>>>>> - Validate UTF-8 at serialization boundaries
>>>>>>>>>>> - Document that input strings must be valid UTF-8 and rely on
>>>>>>>>>>> caller discipline
>>>>>>>>>>>
>>>>>>>>>>> At this point I am not proposing a specific solution. I would
>>>>>>>>>>> like to hear opinions from the community on: We want to require 
>>>>>>>>>>> serialized
>>>>>>>>>>> string data to be valid UTF-8 for cross-language portability
>>>>>>>>>>>
>>>>>>>>>>> Thanks,
>>>>>>>>>>>
>>>>>>>>>>> Hyeonho
>>>>>>>>>>>
>>>>>>>>>>

Re: [E] Re: [DISCUSS] UTF-8 validation for string SerDe across sketches

Reply via email to