I agree. Here is a proposed wording that is a sort of a "policy" way to think about this:
For "container" type sketches that can potentially retain Strings: - If a sketch has the word "string" as part of its name, then UTF-8 validation at update() should be the default with an explicit opt-out. Example: ArrayOfStringsTupleSketch. - If an update method to a sketch has an explicit "string" parameter, then UTF-8 validation should be the default with an explicit opt-out. Example FdtSketch::update(String[]). - Otherwise, if a sketch or update method accepts just a generic type T, then we will provide a UTF-8 validating "SerDe" object that can be optionally used for type T. On Tue, Mar 3, 2026 at 7:32 AM Hyeonho Kim <[email protected]> wrote: > Hi all! > > Unless there are objections, I propose the following: > > 1. > > Introduce an opt-in UTF-8 validating SerDe for std::string (validation > OFF by default). > 2. > > For AoS string items, enable UTF-8 validation at update() by default, > with an explicit opt-out. > > If this direction looks reasonable, I will proceed accordingly in the AoS > PR and follow up with a separate PR for the SerDe option. > > > Thanks, > > Hyeonho > > On Fri, Feb 20, 2026 at 11:59 PM Hyeonho Kim <[email protected]> wrote: > >> Thanks all for the feedback. >> >> >> We can preserve backward compatibility for existing C++ users while also >> providing a clear path for cross-language portability. >> >> How do you think about the following approach? >> >> - SerDe with string: Add an option to validate whether the string >> contains valid UTF-8 sequences. The default would be validation OFF to >> preserve existing compatibility. >> >> - AoS tuple sketch: Validate UTF-8 at the update method (fail-fast). >> Enabling validation by default, with an explicit opt-out for users who want. >> >> >> For DS-Go, we can follow the same policy as C++. >> >> >> Feedback is welcome. >> >> On Wed, Feb 18, 2026 at 3:24 AM Jon Malkin <[email protected]> wrote: >> >>> Gonna agree with Alexander here. I think we should provide a serde >>> option for c++, but that we should not reject non-UTF-8 strings. >>> >>> That wouldn’t just be an API-breaking change. It would break >>> compatibility of c++ with itself for anyone who doesn’t need language >>> portability. >>> >>> A separate utf8_serde option gets my vote. >>> >>> jon >>> >>> On Tue, Feb 17, 2026 at 10:12 AM Alexander Saydakov via dev < >>> [email protected]> wrote: >>> >>>> Regarding C++, I would think that the easiest approach is to instruct >>>> the user to use a UTF8-validating string substitute instead of std::string. >>>> I am not sure whether we should provide such a thing or let the user to >>>> come up with their own implementation. >>>> Consider having a uft8_string that would validate the input in the >>>> constrtuctor but otherwise identical to std::string >>>> So the user can instantiate, for example, >>>> frequent_items_sketch<utf8_string> instead of >>>> frequent_items_sketch<std::string> if validation is necessary. >>>> >>>> >>>> On Sun, Feb 15, 2026 at 8:38 PM Hyeonho Kim <[email protected]> wrote: >>>> >>>>> Thanks for the feedback. I agree that for container sketches that >>>>> retain and serialize strings, we should validate that string payloads are >>>>> valid UTF-8 sequences to preserve cross-language portability. >>>>> >>>>> On *where* to validate in DS-CPP: validating at update() (ingest >>>>> time) is attractive because it is fail-fast, but it also adds additional >>>>> cost on the hot path. If the community is comfortable with that overhead >>>>> for string-based container sketches, I’m happy to pursue the update()-time >>>>> validation approach. >>>>> >>>>> If performance sensitivity is a concern, an alternative would be to >>>>> always validate at (de)serialization boundaries (to guarantee artifact >>>>> correctness), and optionally provide a “fail-fast” mode that enables >>>>> validation at update() as well. >>>>> >>>>> For DS-Go, we can follow the same policy. Go’s situation is a bit >>>>> simpler in implementation because it provides UTF-8 validation in the >>>>> standard library (unicode/utf8), so we wouldn’t need an external >>>>> dependency for the validator. >>>>> >>>>> On Mon, Feb 16, 2026 at 6:29 AM Lee Rhodes <[email protected]> wrote: >>>>> >>>>>> This issue, raised by Hyeonho Kim, relates to sketches that allow a >>>>>> user to update the sketch with a string and the sketch also retains >>>>>> within >>>>>> the sketch a sample of the input strings seen. When serialized, there is >>>>>> an >>>>>> implicit assumption that another user, possibly in a different language, >>>>>> can successfully deserialize those sketch images. These sketches include >>>>>> KLL, >>>>>> REQ, Classic Quantiles, Sampling, FrequentItems, and Tuple. We >>>>>> informally call these "container" sketches, because they contain actual >>>>>> samples from the input stream. HLL, Theta, CPC, BloomFilter, etc., are >>>>>> not >>>>>> container sketches. >>>>>> >>>>>> In the DS-Java library, all container sketches that allow strings >>>>>> always use UTF_8. So the sketch images produced will contain proper UTF_8 >>>>>> sequences. >>>>>> >>>>>> In the DS-CPP library, all the various data types are abstracted via >>>>>> templates. The serialization operation is declared similar to >>>>>> >>>>>> >>>>>> *sketch<T>::serialize(std::ostream& os, const SerDe& sd)where T *is >>>>>> the item type*, os is the output stream and sd* *is the SerDe that >>>>>> performs the conversion to bytes. * >>>>>> >>>>>> >>>>>> If the user wants to use an item of type string, *T* would typically >>>>>> be of type *std::string*, which is just a blob of bytes and no >>>>>> requirement that it is UTF_8. >>>>>> >>>>>> >>>>>> So far, we have trusted users of the library to know that if they >>>>>> update one of these container classes with a type *T,* that the >>>>>> downstream user can successfully decode it. But this could be >>>>>> catastrophic: A downstream user of a sketch image could be separated >>>>>> from >>>>>> the creation of the sketch image by years and be using a different >>>>>> language. >>>>>> >>>>>> One of the big advantages of our DataSketches project is that our >>>>>> serialization images should be language and platform independent, >>>>>> allowing >>>>>> cross-language and cross platform interchange of sketches. >>>>>> >>>>>> Hyeonho Kim's recommendation makes sense: For serialized sketch >>>>>> images that contain strings, those strings must be UTF_8. >>>>>> >>>>>> So how do we implement that? My thoughts are as follows: >>>>>> >>>>>> 1. We should document now in the website and in appropriate >>>>>> places in the library the potential danger of not using UTF_8 >>>>>> strings. (At >>>>>> least until we have a more robust solution) >>>>>> 2. I think implementing validation checks on UTF_8 strings at the >>>>>> SerDe boundaries may be too late. A user could have processed a large >>>>>> stream of data only to discover a failure at serialization time, which >>>>>> could be much later in time. The other possibility would be to >>>>>> validate >>>>>> the strings at the input into the sketch, typically in the *update() >>>>>> *method. >>>>>> 3. For C++, there are 3rd party libraries that specialize in >>>>>> UTF_8 validation, including ICU >>>>>> >>>>>> <https://urldefense.com/v3/__https://github.com/unicode-org/icu__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpPC5K7q2A$> >>>>>> , UTF8-CPP >>>>>> >>>>>> <https://urldefense.com/v3/__https://github.com/nemtrif/utfcpp__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpNk0hS7xg$> >>>>>> and simjson >>>>>> >>>>>> <https://urldefense.com/v3/__https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpMVUko1NQ$>. >>>>>> (These have standard licensing). From what I've read, UTF-8 >>>>>> validation, if >>>>>> done correctly, can be done very fast, with only a small section of >>>>>> code. >>>>>> 4. I am not sure what the solutions are for Rust or Go. >>>>>> >>>>>> I welcome your feedback. >>>>>> >>>>>> >>>>>> On Sat, Feb 14, 2026 at 1:47 AM tison <[email protected]> wrote: >>>>>> >>>>>>> This PR [1] of datasketches-rust demonstrates how the Rust impl >>>>>>> deserializes String values. >>>>>>> >>>>>>> [1] https://github.com/apache/datasketches-rust/pull/82 >>>>>>> <https://urldefense.com/v3/__https://github.com/apache/datasketches-rust/pull/82__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpN3yo3d3w$> >>>>>>> >>>>>>> If it's std::string::String, then it must be of UTF-8 encoding. And >>>>>>> we check the encoding on deserialization. >>>>>>> >>>>>>> However, the Rust ecosystem also supports "strings" that do not use >>>>>>> UTF-8, such as BStr. >>>>>>> >>>>>>> So, my opinions are: >>>>>>> >>>>>>> 1. It's good to assume serialized string data to be valid UTF-8. >>>>>>> 2. Even if it isn't, for datasketches-rust, users should be able to >>>>>>> choose a proper type to deserialize the bytes into a type that doesn't >>>>>>> require UTF-8 encoding. >>>>>>> >>>>>>> Best, >>>>>>> tison. >>>>>>> >>>>>>> >>>>>>> Hyeonho Kim <[email protected]> 于2026年2月14日周六 17:24写道: >>>>>>> >>>>>>>> Hi all, >>>>>>>> >>>>>>>> While working on UTF-8 validation for the AoS tuple sketch in C++ >>>>>>>> (ref: https://github.com/apache/datasketches-cpp/pull/476 >>>>>>>> <https://urldefense.com/v3/__https://github.com/apache/datasketches-cpp/pull/476__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpPslrtDnQ$>), >>>>>>>> a broader design question came up that may affect multiple sketches. >>>>>>>> >>>>>>>> Based on my current understanding: >>>>>>>> >>>>>>>> - In datasketches-java, string serialization already produces valid >>>>>>>> UTF-8 bytes via getBytes(StandardCharsets.UTF_8). So Java-generated >>>>>>>> artifacts already assume valid UTF-8 string encoding. >>>>>>>> - Rust and Python string types represent Unicode text and can be >>>>>>>> encoded to UTF-8. Please correct me if I am mistaken. (I don't know >>>>>>>> Rust >>>>>>>> and Python well) >>>>>>>> - In Go, string is a byte sequence and may contain invalid UTF-8 >>>>>>>> unless explicitly validated. So during serialization, it may produce >>>>>>>> invalid UTF-8 sequences. >>>>>>>> - In C++, std::string is also a byte container and does not enforce >>>>>>>> UTF-8 validity. So during serialization, it may produce invalid UTF-8 >>>>>>>> sequences. >>>>>>>> >>>>>>>> If I am mistaken on any of these points, I would appreciate >>>>>>>> corrections. >>>>>>>> >>>>>>>> If we want to maintain cross-language portability for serialized >>>>>>>> artifacts, one possible approach would be to ensure that any serialized >>>>>>>> string data is valid UTF-8. This could potentially apply to any >>>>>>>> sketches >>>>>>>> that serialize or deserialize string data. >>>>>>>> >>>>>>>> There seem to be several possible approaches: >>>>>>>> - Validate UTF-8 at serialization boundaries >>>>>>>> - Document that input strings must be valid UTF-8 and rely on >>>>>>>> caller discipline >>>>>>>> >>>>>>>> At this point I am not proposing a specific solution. I would like >>>>>>>> to hear opinions from the community on: We want to require serialized >>>>>>>> string data to be valid UTF-8 for cross-language portability >>>>>>>> >>>>>>>> Thanks, >>>>>>>> >>>>>>>> Hyeonho >>>>>>>> >>>>>>>
