Gonna agree with Alexander here. I think we should provide a serde option for c++, but that we should not reject non-UTF-8 strings.
That wouldn’t just be an API-breaking change. It would break compatibility of c++ with itself for anyone who doesn’t need language portability. A separate utf8_serde option gets my vote. jon On Tue, Feb 17, 2026 at 10:12 AM Alexander Saydakov via dev < [email protected]> wrote: > Regarding C++, I would think that the easiest approach is to instruct the > user to use a UTF8-validating string substitute instead of std::string. > I am not sure whether we should provide such a thing or let the user to > come up with their own implementation. > Consider having a uft8_string that would validate the input in the > constrtuctor but otherwise identical to std::string > So the user can instantiate, for example, > frequent_items_sketch<utf8_string> instead of > frequent_items_sketch<std::string> if validation is necessary. > > > On Sun, Feb 15, 2026 at 8:38 PM Hyeonho Kim <[email protected]> wrote: > >> Thanks for the feedback. I agree that for container sketches that retain >> and serialize strings, we should validate that string payloads are valid >> UTF-8 sequences to preserve cross-language portability. >> >> On *where* to validate in DS-CPP: validating at update() (ingest time) >> is attractive because it is fail-fast, but it also adds additional cost on >> the hot path. If the community is comfortable with that overhead for >> string-based container sketches, I’m happy to pursue the update()-time >> validation approach. >> >> If performance sensitivity is a concern, an alternative would be to >> always validate at (de)serialization boundaries (to guarantee artifact >> correctness), and optionally provide a “fail-fast” mode that enables >> validation at update() as well. >> >> For DS-Go, we can follow the same policy. Go’s situation is a bit simpler >> in implementation because it provides UTF-8 validation in the standard >> library (unicode/utf8), so we wouldn’t need an external dependency for >> the validator. >> >> On Mon, Feb 16, 2026 at 6:29 AM Lee Rhodes <[email protected]> wrote: >> >>> This issue, raised by Hyeonho Kim, relates to sketches that allow a user >>> to update the sketch with a string and the sketch also retains within >>> the sketch a sample of the input strings seen. When serialized, there is an >>> implicit assumption that another user, possibly in a different language, >>> can successfully deserialize those sketch images. These sketches include >>> KLL, >>> REQ, Classic Quantiles, Sampling, FrequentItems, and Tuple. We >>> informally call these "container" sketches, because they contain actual >>> samples from the input stream. HLL, Theta, CPC, BloomFilter, etc., are not >>> container sketches. >>> >>> In the DS-Java library, all container sketches that allow strings always >>> use UTF_8. So the sketch images produced will contain proper UTF_8 >>> sequences. >>> >>> In the DS-CPP library, all the various data types are abstracted via >>> templates. The serialization operation is declared similar to >>> >>> >>> *sketch<T>::serialize(std::ostream& os, const SerDe& sd)where T *is the >>> item type*, os is the output stream and sd* *is the SerDe that performs >>> the conversion to bytes. * >>> >>> >>> If the user wants to use an item of type string, *T* would typically be >>> of type *std::string*, which is just a blob of bytes and no requirement >>> that it is UTF_8. >>> >>> >>> So far, we have trusted users of the library to know that if they update >>> one of these container classes with a type *T,* that the downstream >>> user can successfully decode it. But this could be catastrophic: A >>> downstream user of a sketch image could be separated from the creation of >>> the sketch image by years and be using a different language. >>> >>> One of the big advantages of our DataSketches project is that our >>> serialization images should be language and platform independent, allowing >>> cross-language and cross platform interchange of sketches. >>> >>> Hyeonho Kim's recommendation makes sense: For serialized sketch images >>> that contain strings, those strings must be UTF_8. >>> >>> So how do we implement that? My thoughts are as follows: >>> >>> 1. We should document now in the website and in appropriate places >>> in the library the potential danger of not using UTF_8 strings. (At least >>> until we have a more robust solution) >>> 2. I think implementing validation checks on UTF_8 strings at the >>> SerDe boundaries may be too late. A user could have processed a large >>> stream of data only to discover a failure at serialization time, which >>> could be much later in time. The other possibility would be to validate >>> the strings at the input into the sketch, typically in the *update() >>> *method. >>> 3. For C++, there are 3rd party libraries that specialize in UTF_8 >>> validation, including ICU >>> >>> <https://urldefense.com/v3/__https://github.com/unicode-org/icu__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpPC5K7q2A$> >>> , UTF8-CPP >>> >>> <https://urldefense.com/v3/__https://github.com/nemtrif/utfcpp__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpNk0hS7xg$> >>> and simjson >>> >>> <https://urldefense.com/v3/__https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpMVUko1NQ$>. >>> (These have standard licensing). From what I've read, UTF-8 validation, >>> if >>> done correctly, can be done very fast, with only a small section of code. >>> 4. I am not sure what the solutions are for Rust or Go. >>> >>> I welcome your feedback. >>> >>> >>> On Sat, Feb 14, 2026 at 1:47 AM tison <[email protected]> wrote: >>> >>>> This PR [1] of datasketches-rust demonstrates how the Rust impl >>>> deserializes String values. >>>> >>>> [1] https://github.com/apache/datasketches-rust/pull/82 >>>> <https://urldefense.com/v3/__https://github.com/apache/datasketches-rust/pull/82__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpN3yo3d3w$> >>>> >>>> If it's std::string::String, then it must be of UTF-8 encoding. And we >>>> check the encoding on deserialization. >>>> >>>> However, the Rust ecosystem also supports "strings" that do not use >>>> UTF-8, such as BStr. >>>> >>>> So, my opinions are: >>>> >>>> 1. It's good to assume serialized string data to be valid UTF-8. >>>> 2. Even if it isn't, for datasketches-rust, users should be able to >>>> choose a proper type to deserialize the bytes into a type that doesn't >>>> require UTF-8 encoding. >>>> >>>> Best, >>>> tison. >>>> >>>> >>>> Hyeonho Kim <[email protected]> 于2026年2月14日周六 17:24写道: >>>> >>>>> Hi all, >>>>> >>>>> While working on UTF-8 validation for the AoS tuple sketch in C++ >>>>> (ref: https://github.com/apache/datasketches-cpp/pull/476 >>>>> <https://urldefense.com/v3/__https://github.com/apache/datasketches-cpp/pull/476__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpPslrtDnQ$>), >>>>> a broader design question came up that may affect multiple sketches. >>>>> >>>>> Based on my current understanding: >>>>> >>>>> - In datasketches-java, string serialization already produces valid >>>>> UTF-8 bytes via getBytes(StandardCharsets.UTF_8). So Java-generated >>>>> artifacts already assume valid UTF-8 string encoding. >>>>> - Rust and Python string types represent Unicode text and can be >>>>> encoded to UTF-8. Please correct me if I am mistaken. (I don't know Rust >>>>> and Python well) >>>>> - In Go, string is a byte sequence and may contain invalid UTF-8 >>>>> unless explicitly validated. So during serialization, it may produce >>>>> invalid UTF-8 sequences. >>>>> - In C++, std::string is also a byte container and does not enforce >>>>> UTF-8 validity. So during serialization, it may produce invalid UTF-8 >>>>> sequences. >>>>> >>>>> If I am mistaken on any of these points, I would appreciate >>>>> corrections. >>>>> >>>>> If we want to maintain cross-language portability for serialized >>>>> artifacts, one possible approach would be to ensure that any serialized >>>>> string data is valid UTF-8. This could potentially apply to any sketches >>>>> that serialize or deserialize string data. >>>>> >>>>> There seem to be several possible approaches: >>>>> - Validate UTF-8 at serialization boundaries >>>>> - Document that input strings must be valid UTF-8 and rely on caller >>>>> discipline >>>>> >>>>> At this point I am not proposing a specific solution. I would like to >>>>> hear opinions from the community on: We want to require serialized string >>>>> data to be valid UTF-8 for cross-language portability >>>>> >>>>> Thanks, >>>>> >>>>> Hyeonho >>>>> >>>>
