Re: [DISCUSS] UTF-8 validation for string SerDe across sketches

Hyeonho Kim Sun, 15 Feb 2026 20:38:03 -0800

Thanks for the feedback. I agree that for container sketches that retain
and serialize strings, we should validate that string payloads are valid
UTF-8 sequences to preserve cross-language portability.


On *where* to validate in DS-CPP: validating at update() (ingest time) is
attractive because it is fail-fast, but it also adds additional cost on the
hot path. If the community is comfortable with that overhead for
string-based container sketches, I’m happy to pursue the update()-time
validation approach.

If performance sensitivity is a concern, an alternative would be to always
validate at (de)serialization boundaries (to guarantee artifact
correctness), and optionally provide a “fail-fast” mode that enables
validation at update() as well.

For DS-Go, we can follow the same policy. Go’s situation is a bit simpler
in implementation because it provides UTF-8 validation in the standard
library (unicode/utf8), so we wouldn’t need an external dependency for the
validator.

On Mon, Feb 16, 2026 at 6:29 AM Lee Rhodes <[email protected]> wrote:

> This issue, raised by Hyeonho Kim, relates to sketches that allow a user
> to update the sketch with a string and the sketch also retains within
> the sketch a sample of the input strings seen. When serialized, there is an
> implicit assumption that another user, possibly in a different language,
> can successfully deserialize those sketch images. These sketches include KLL,
> REQ, Classic Quantiles, Sampling, FrequentItems, and Tuple. We informally
> call these "container" sketches, because they contain actual samples from
> the input stream.  HLL, Theta, CPC, BloomFilter, etc., are not container
> sketches.
>
> In the DS-Java library, all container sketches that allow strings always
> use UTF_8. So the sketch images produced will contain proper UTF_8
> sequences.
>
> In the DS-CPP library, all the various data types are abstracted via
> templates. The serialization operation is declared similar to
>
>
> *sketch<T>::serialize(std::ostream& os, const SerDe& sd)where T *is the
> item type*, os is the output stream and sd* *is the SerDe that performs
> the conversion to bytes. *
>
>
> If the user wants to use an item of type string, *T* would typically be
> of type *std::string*, which is just a blob of bytes and no requirement
> that it is UTF_8.
>
>
> So far, we have trusted users of the library to know that if they update
> one of these container classes with a type *T,* that the downstream user
> can successfully decode it. But this could be catastrophic:  A downstream
> user of a sketch image could be separated from the creation of the sketch
> image by years and be using a different language.
>
> One of the big advantages of our DataSketches project is that our
> serialization images should be language and platform independent, allowing
> cross-language and cross platform interchange of sketches.
>
> Hyeonho Kim's recommendation makes sense: For serialized sketch images
> that contain strings, those strings must be UTF_8.
>
> So how do we implement that?  My thoughts are as follows:
>
>    1. We should document now in the website and in appropriate places in
>    the library the potential danger of not using UTF_8 strings. (At least
>    until we have a more robust solution)
>    2. I think implementing validation checks on UTF_8 strings at the
>    SerDe boundaries may be too late.  A user could have processed a large
>    stream of data only to discover a failure at serialization time, which
>    could be much later in time.  The other possibility would be to validate
>    the strings at the input into the sketch, typically in the *update() *
>    method.
>    3. For C++, there are 3rd party libraries that specialize in UTF_8
>    validation, including ICU <https://github.com/unicode-org/icu>,
>    UTF8-CPP <https://github.com/nemtrif/utfcpp> and simjson
>    
> <https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/>.
>    (These have standard licensing). From what I've read, UTF-8 validation, if
>    done correctly, can be done very fast, with only a small section of code.
>    4. I am not sure what the solutions are for Rust or Go.
>
> I welcome your feedback.
>
>
> On Sat, Feb 14, 2026 at 1:47 AM tison <[email protected]> wrote:
>
>> This PR [1] of datasketches-rust demonstrates how the Rust impl
>> deserializes String values.
>>
>> [1] https://github.com/apache/datasketches-rust/pull/82
>>
>> If it's std::string::String, then it must be of UTF-8 encoding. And we
>> check the encoding on deserialization.
>>
>> However, the Rust ecosystem also supports "strings" that do not use
>> UTF-8, such as BStr.
>>
>> So, my opinions are:
>>
>> 1. It's good to assume serialized string data to be valid UTF-8.
>> 2. Even if it isn't, for datasketches-rust, users should be able to
>> choose a proper type to deserialize the bytes into a type that doesn't
>> require UTF-8 encoding.
>>
>> Best,
>> tison.
>>
>>
>> Hyeonho Kim <[email protected]> 于2026年2月14日周六 17:24写道：
>>
>>> Hi all,
>>>
>>> While working on UTF-8 validation for the AoS tuple sketch in C++ (ref:
>>> https://github.com/apache/datasketches-cpp/pull/476), a broader design
>>> question came up that may affect multiple sketches.
>>>
>>> Based on my current understanding:
>>>
>>> - In datasketches-java, string serialization already produces valid
>>> UTF-8 bytes via getBytes(StandardCharsets.UTF_8). So Java-generated
>>> artifacts already assume valid UTF-8 string encoding.
>>> - Rust and Python string types represent Unicode text and can be encoded
>>> to UTF-8. Please correct me if I am mistaken. (I don't know Rust and Python
>>> well)
>>> - In Go, string is a byte sequence and may contain invalid UTF-8 unless
>>> explicitly validated. So during serialization, it may produce invalid UTF-8
>>> sequences.
>>> - In C++, std::string is also a byte container and does not enforce
>>> UTF-8 validity. So during serialization, it may produce invalid UTF-8
>>> sequences.
>>>
>>> If I am mistaken on any of these points, I would appreciate corrections.
>>>
>>> If we want to maintain cross-language portability for serialized
>>> artifacts, one possible approach would be to ensure that any serialized
>>> string data is valid UTF-8. This could potentially apply to any sketches
>>> that serialize or deserialize string data.
>>>
>>> There seem to be several possible approaches:
>>> - Validate UTF-8 at serialization boundaries
>>> - Document that input strings must be valid UTF-8 and rely on caller
>>> discipline
>>>
>>> At this point I am not proposing a specific solution. I would like to
>>> hear opinions from the community on: We want to require serialized string
>>> data to be valid UTF-8 for cross-language portability
>>>
>>> Thanks,
>>>
>>> Hyeonho
>>>
>>

Re: [DISCUSS] UTF-8 validation for string SerDe across sketches

Reply via email to