Thanks for the feedback. I agree that for container sketches that retain and serialize strings, we should validate that string payloads are valid UTF-8 sequences to preserve cross-language portability.
On *where* to validate in DS-CPP: validating at update() (ingest time) is attractive because it is fail-fast, but it also adds additional cost on the hot path. If the community is comfortable with that overhead for string-based container sketches, I’m happy to pursue the update()-time validation approach. If performance sensitivity is a concern, an alternative would be to always validate at (de)serialization boundaries (to guarantee artifact correctness), and optionally provide a “fail-fast” mode that enables validation at update() as well. For DS-Go, we can follow the same policy. Go’s situation is a bit simpler in implementation because it provides UTF-8 validation in the standard library (unicode/utf8), so we wouldn’t need an external dependency for the validator. On Mon, Feb 16, 2026 at 6:29 AM Lee Rhodes <[email protected]> wrote: > This issue, raised by Hyeonho Kim, relates to sketches that allow a user > to update the sketch with a string and the sketch also retains within > the sketch a sample of the input strings seen. When serialized, there is an > implicit assumption that another user, possibly in a different language, > can successfully deserialize those sketch images. These sketches include KLL, > REQ, Classic Quantiles, Sampling, FrequentItems, and Tuple. We informally > call these "container" sketches, because they contain actual samples from > the input stream. HLL, Theta, CPC, BloomFilter, etc., are not container > sketches. > > In the DS-Java library, all container sketches that allow strings always > use UTF_8. So the sketch images produced will contain proper UTF_8 > sequences. > > In the DS-CPP library, all the various data types are abstracted via > templates. The serialization operation is declared similar to > > > *sketch<T>::serialize(std::ostream& os, const SerDe& sd)where T *is the > item type*, os is the output stream and sd* *is the SerDe that performs > the conversion to bytes. * > > > If the user wants to use an item of type string, *T* would typically be > of type *std::string*, which is just a blob of bytes and no requirement > that it is UTF_8. > > > So far, we have trusted users of the library to know that if they update > one of these container classes with a type *T,* that the downstream user > can successfully decode it. But this could be catastrophic: A downstream > user of a sketch image could be separated from the creation of the sketch > image by years and be using a different language. > > One of the big advantages of our DataSketches project is that our > serialization images should be language and platform independent, allowing > cross-language and cross platform interchange of sketches. > > Hyeonho Kim's recommendation makes sense: For serialized sketch images > that contain strings, those strings must be UTF_8. > > So how do we implement that? My thoughts are as follows: > > 1. We should document now in the website and in appropriate places in > the library the potential danger of not using UTF_8 strings. (At least > until we have a more robust solution) > 2. I think implementing validation checks on UTF_8 strings at the > SerDe boundaries may be too late. A user could have processed a large > stream of data only to discover a failure at serialization time, which > could be much later in time. The other possibility would be to validate > the strings at the input into the sketch, typically in the *update() * > method. > 3. For C++, there are 3rd party libraries that specialize in UTF_8 > validation, including ICU <https://github.com/unicode-org/icu>, > UTF8-CPP <https://github.com/nemtrif/utfcpp> and simjson > > <https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/>. > (These have standard licensing). From what I've read, UTF-8 validation, if > done correctly, can be done very fast, with only a small section of code. > 4. I am not sure what the solutions are for Rust or Go. > > I welcome your feedback. > > > On Sat, Feb 14, 2026 at 1:47 AM tison <[email protected]> wrote: > >> This PR [1] of datasketches-rust demonstrates how the Rust impl >> deserializes String values. >> >> [1] https://github.com/apache/datasketches-rust/pull/82 >> >> If it's std::string::String, then it must be of UTF-8 encoding. And we >> check the encoding on deserialization. >> >> However, the Rust ecosystem also supports "strings" that do not use >> UTF-8, such as BStr. >> >> So, my opinions are: >> >> 1. It's good to assume serialized string data to be valid UTF-8. >> 2. Even if it isn't, for datasketches-rust, users should be able to >> choose a proper type to deserialize the bytes into a type that doesn't >> require UTF-8 encoding. >> >> Best, >> tison. >> >> >> Hyeonho Kim <[email protected]> 于2026年2月14日周六 17:24写道: >> >>> Hi all, >>> >>> While working on UTF-8 validation for the AoS tuple sketch in C++ (ref: >>> https://github.com/apache/datasketches-cpp/pull/476), a broader design >>> question came up that may affect multiple sketches. >>> >>> Based on my current understanding: >>> >>> - In datasketches-java, string serialization already produces valid >>> UTF-8 bytes via getBytes(StandardCharsets.UTF_8). So Java-generated >>> artifacts already assume valid UTF-8 string encoding. >>> - Rust and Python string types represent Unicode text and can be encoded >>> to UTF-8. Please correct me if I am mistaken. (I don't know Rust and Python >>> well) >>> - In Go, string is a byte sequence and may contain invalid UTF-8 unless >>> explicitly validated. So during serialization, it may produce invalid UTF-8 >>> sequences. >>> - In C++, std::string is also a byte container and does not enforce >>> UTF-8 validity. So during serialization, it may produce invalid UTF-8 >>> sequences. >>> >>> If I am mistaken on any of these points, I would appreciate corrections. >>> >>> If we want to maintain cross-language portability for serialized >>> artifacts, one possible approach would be to ensure that any serialized >>> string data is valid UTF-8. This could potentially apply to any sketches >>> that serialize or deserialize string data. >>> >>> There seem to be several possible approaches: >>> - Validate UTF-8 at serialization boundaries >>> - Document that input strings must be valid UTF-8 and rely on caller >>> discipline >>> >>> At this point I am not proposing a specific solution. I would like to >>> hear opinions from the community on: We want to require serialized string >>> data to be valid UTF-8 for cross-language portability >>> >>> Thanks, >>> >>> Hyeonho >>> >>
