Thanks. After thinking more about it and reviewing the C++ and Go code more closely, my view has changed.
I now think that changing the serialization format just to preserve UTF-8 validation behavior for C++ and Go would be too heavy. If we do not change the serialization format, then we cannot fully preserve behavioral consistency across serialization/deserialization anyway. At the same time, I do not think we should ignore language-independent sketch images for string-containing sketches. So my current view is that we should keep the sketch format unchanged and leave `update()` behavior unchanged. If possible, we provide an explicit portability path through UTF-8 validating SerDe choices. If that is not desirable, then at minimum I think we should document this point clearly. In particular, I think we should document clearly that cross-language portability for string-containing sketches depends on using valid UTF-8. On Sat, Mar 7, 2026 at 4:47 PM Alexander Saydakov via dev < [email protected]> wrote: > I would reiterate that in my view sketches should not care about > validation. > If the user desires validation, he can instantiate, say, > frequent_items_sketch<utf8_string> instead of > frequent_items_sketch<std::string>. > utf8_string should perform validation. > > On Fri, Mar 6, 2026 at 10:17 PM Hyeonho Kim <[email protected]> wrote: > >> Hi all, >> >> I realized there is one more design point that may need discussion. >> >> For sketches that validate UTF-8 at update() time by default, with an >> explicit opt-out, that setting affects the behavior of future update() >> calls even after deserialization. >> >> So there seems to be a broader design choice here for string-specific >> sketches / update APIs: >> >> 1. >> >> Treat the UTF-8 validation setting as part of the serialized sketch >> state, so it is preserved across serialization/deserialization. >> 2. >> >> Treat it as a runtime policy only, in which case it would need to be >> specified again after deserialization (or when constructing a new sketch). >> >> The first option would preserve behavioral consistency, so it seems like >> the more semantically consistent choice. However, it also seems like a much >> bigger decision in practice, since it would require a serialization format >> change / versioning. >> >> The second option avoids changing the serialized format, but a >> deserialized sketch may not behave exactly the same for future update() >> calls unless the caller explicitly restores the same policy. >> >> What do others think? >> >> On Wed, Mar 4, 2026 at 5:30 AM Lee Rhodes <[email protected]> wrote: >> >>> I agree. Here is a proposed wording that is a sort of a "policy" way to >>> think about this: >>> >>> For "container" type sketches that can potentially retain Strings: >>> >>> - If a sketch has the word "string" as part of its name, then UTF-8 >>> validation at update() should be the default with an explicit >>> opt-out. Example: ArrayOfStringsTupleSketch. >>> - If an update method to a sketch has an explicit "string" >>> parameter, then UTF-8 validation should be the default with an explicit >>> opt-out. Example FdtSketch::update(String[]). >>> - Otherwise, if a sketch or update method accepts just a generic >>> type T, then we will provide a UTF-8 validating "SerDe" object that can >>> be >>> optionally used for type T. >>> >>> >>> >>> On Tue, Mar 3, 2026 at 7:32 AM Hyeonho Kim <[email protected]> wrote: >>> >>>> Hi all! >>>> >>>> Unless there are objections, I propose the following: >>>> >>>> 1. >>>> >>>> Introduce an opt-in UTF-8 validating SerDe for std::string >>>> (validation OFF by default). >>>> 2. >>>> >>>> For AoS string items, enable UTF-8 validation at update() by >>>> default, with an explicit opt-out. >>>> >>>> If this direction looks reasonable, I will proceed accordingly in the >>>> AoS PR and follow up with a separate PR for the SerDe option. >>>> >>>> >>>> Thanks, >>>> >>>> Hyeonho >>>> >>>> On Fri, Feb 20, 2026 at 11:59 PM Hyeonho Kim <[email protected]> >>>> wrote: >>>> >>>>> Thanks all for the feedback. >>>>> >>>>> >>>>> We can preserve backward compatibility for existing C++ users while >>>>> also providing a clear path for cross-language portability. >>>>> >>>>> How do you think about the following approach? >>>>> >>>>> - SerDe with string: Add an option to validate whether the string >>>>> contains valid UTF-8 sequences. The default would be validation OFF to >>>>> preserve existing compatibility. >>>>> >>>>> - AoS tuple sketch: Validate UTF-8 at the update method (fail-fast). >>>>> Enabling validation by default, with an explicit opt-out for users who >>>>> want. >>>>> >>>>> >>>>> For DS-Go, we can follow the same policy as C++. >>>>> >>>>> >>>>> Feedback is welcome. >>>>> >>>>> On Wed, Feb 18, 2026 at 3:24 AM Jon Malkin <[email protected]> >>>>> wrote: >>>>> >>>>>> Gonna agree with Alexander here. I think we should provide a serde >>>>>> option for c++, but that we should not reject non-UTF-8 strings. >>>>>> >>>>>> That wouldn’t just be an API-breaking change. It would break >>>>>> compatibility of c++ with itself for anyone who doesn’t need language >>>>>> portability. >>>>>> >>>>>> A separate utf8_serde option gets my vote. >>>>>> >>>>>> jon >>>>>> >>>>>> On Tue, Feb 17, 2026 at 10:12 AM Alexander Saydakov via dev < >>>>>> [email protected]> wrote: >>>>>> >>>>>>> Regarding C++, I would think that the easiest approach is to >>>>>>> instruct the user to use a UTF8-validating string substitute instead of >>>>>>> std::string. >>>>>>> I am not sure whether we should provide such a thing or let the user >>>>>>> to come up with their own implementation. >>>>>>> Consider having a uft8_string that would validate the input in the >>>>>>> constrtuctor but otherwise identical to std::string >>>>>>> So the user can instantiate, for example, >>>>>>> frequent_items_sketch<utf8_string> instead of >>>>>>> frequent_items_sketch<std::string> if validation is necessary. >>>>>>> >>>>>>> >>>>>>> On Sun, Feb 15, 2026 at 8:38 PM Hyeonho Kim <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Thanks for the feedback. I agree that for container sketches that >>>>>>>> retain and serialize strings, we should validate that string payloads >>>>>>>> are >>>>>>>> valid UTF-8 sequences to preserve cross-language portability. >>>>>>>> >>>>>>>> On *where* to validate in DS-CPP: validating at update() (ingest >>>>>>>> time) is attractive because it is fail-fast, but it also adds >>>>>>>> additional >>>>>>>> cost on the hot path. If the community is comfortable with that >>>>>>>> overhead >>>>>>>> for string-based container sketches, I’m happy to pursue the >>>>>>>> update()-time validation approach. >>>>>>>> >>>>>>>> If performance sensitivity is a concern, an alternative would be to >>>>>>>> always validate at (de)serialization boundaries (to guarantee artifact >>>>>>>> correctness), and optionally provide a “fail-fast” mode that enables >>>>>>>> validation at update() as well. >>>>>>>> >>>>>>>> For DS-Go, we can follow the same policy. Go’s situation is a bit >>>>>>>> simpler in implementation because it provides UTF-8 validation in the >>>>>>>> standard library (unicode/utf8), so we wouldn’t need an external >>>>>>>> dependency for the validator. >>>>>>>> >>>>>>>> On Mon, Feb 16, 2026 at 6:29 AM Lee Rhodes <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> This issue, raised by Hyeonho Kim, relates to sketches that allow >>>>>>>>> a user to update the sketch with a string and the sketch also retains >>>>>>>>> within the sketch a sample of the input strings seen. When serialized, >>>>>>>>> there is an implicit assumption that another user, possibly in a >>>>>>>>> different >>>>>>>>> language, can successfully deserialize those sketch images. These >>>>>>>>> sketches >>>>>>>>> include KLL, REQ, Classic Quantiles, Sampling, FrequentItems, and >>>>>>>>> Tuple. We informally call these "container" sketches, because they >>>>>>>>> contain >>>>>>>>> actual samples from the input stream. HLL, Theta, CPC, BloomFilter, >>>>>>>>> etc., >>>>>>>>> are not container sketches. >>>>>>>>> >>>>>>>>> In the DS-Java library, all container sketches that allow strings >>>>>>>>> always use UTF_8. So the sketch images produced will contain proper >>>>>>>>> UTF_8 >>>>>>>>> sequences. >>>>>>>>> >>>>>>>>> In the DS-CPP library, all the various data types are abstracted >>>>>>>>> via templates. The serialization operation is declared similar to >>>>>>>>> >>>>>>>>> >>>>>>>>> *sketch<T>::serialize(std::ostream& os, const SerDe& sd)where T *is >>>>>>>>> the item type*, os is the output stream and sd* *is the SerDe >>>>>>>>> that performs the conversion to bytes. * >>>>>>>>> >>>>>>>>> >>>>>>>>> If the user wants to use an item of type string, *T* would >>>>>>>>> typically be of type *std::string*, which is just a blob of bytes >>>>>>>>> and no requirement that it is UTF_8. >>>>>>>>> >>>>>>>>> >>>>>>>>> So far, we have trusted users of the library to know that if they >>>>>>>>> update one of these container classes with a type *T,* that the >>>>>>>>> downstream user can successfully decode it. But this could be >>>>>>>>> catastrophic: A downstream user of a sketch image could be separated >>>>>>>>> from >>>>>>>>> the creation of the sketch image by years and be using a different >>>>>>>>> language. >>>>>>>>> >>>>>>>>> One of the big advantages of our DataSketches project is that our >>>>>>>>> serialization images should be language and platform independent, >>>>>>>>> allowing >>>>>>>>> cross-language and cross platform interchange of sketches. >>>>>>>>> >>>>>>>>> Hyeonho Kim's recommendation makes sense: For serialized sketch >>>>>>>>> images that contain strings, those strings must be UTF_8. >>>>>>>>> >>>>>>>>> So how do we implement that? My thoughts are as follows: >>>>>>>>> >>>>>>>>> 1. We should document now in the website and in appropriate >>>>>>>>> places in the library the potential danger of not using UTF_8 >>>>>>>>> strings. (At >>>>>>>>> least until we have a more robust solution) >>>>>>>>> 2. I think implementing validation checks on UTF_8 strings at >>>>>>>>> the SerDe boundaries may be too late. A user could have processed >>>>>>>>> a large >>>>>>>>> stream of data only to discover a failure at serialization time, >>>>>>>>> which >>>>>>>>> could be much later in time. The other possibility would be to >>>>>>>>> validate >>>>>>>>> the strings at the input into the sketch, typically in the >>>>>>>>> *update() >>>>>>>>> *method. >>>>>>>>> 3. For C++, there are 3rd party libraries that specialize in >>>>>>>>> UTF_8 validation, including ICU >>>>>>>>> >>>>>>>>> <https://urldefense.com/v3/__https://github.com/unicode-org/icu__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpPC5K7q2A$> >>>>>>>>> , UTF8-CPP >>>>>>>>> >>>>>>>>> <https://urldefense.com/v3/__https://github.com/nemtrif/utfcpp__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpNk0hS7xg$> >>>>>>>>> and simjson >>>>>>>>> >>>>>>>>> <https://urldefense.com/v3/__https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpMVUko1NQ$>. >>>>>>>>> (These have standard licensing). From what I've read, UTF-8 >>>>>>>>> validation, if >>>>>>>>> done correctly, can be done very fast, with only a small section >>>>>>>>> of code. >>>>>>>>> 4. I am not sure what the solutions are for Rust or Go. >>>>>>>>> >>>>>>>>> I welcome your feedback. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sat, Feb 14, 2026 at 1:47 AM tison <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> This PR [1] of datasketches-rust demonstrates how the Rust impl >>>>>>>>>> deserializes String values. >>>>>>>>>> >>>>>>>>>> [1] https://github.com/apache/datasketches-rust/pull/82 >>>>>>>>>> <https://urldefense.com/v3/__https://github.com/apache/datasketches-rust/pull/82__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpN3yo3d3w$> >>>>>>>>>> >>>>>>>>>> If it's std::string::String, then it must be of UTF-8 encoding. >>>>>>>>>> And we check the encoding on deserialization. >>>>>>>>>> >>>>>>>>>> However, the Rust ecosystem also supports "strings" that do not >>>>>>>>>> use UTF-8, such as BStr. >>>>>>>>>> >>>>>>>>>> So, my opinions are: >>>>>>>>>> >>>>>>>>>> 1. It's good to assume serialized string data to be valid UTF-8. >>>>>>>>>> 2. Even if it isn't, for datasketches-rust, users should be able >>>>>>>>>> to choose a proper type to deserialize the bytes into a type that >>>>>>>>>> doesn't >>>>>>>>>> require UTF-8 encoding. >>>>>>>>>> >>>>>>>>>> Best, >>>>>>>>>> tison. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Hyeonho Kim <[email protected]> 于2026年2月14日周六 17:24写道: >>>>>>>>>> >>>>>>>>>>> Hi all, >>>>>>>>>>> >>>>>>>>>>> While working on UTF-8 validation for the AoS tuple sketch in >>>>>>>>>>> C++ (ref: https://github.com/apache/datasketches-cpp/pull/476 >>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/apache/datasketches-cpp/pull/476__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpPslrtDnQ$>), >>>>>>>>>>> a broader design question came up that may affect multiple sketches. >>>>>>>>>>> >>>>>>>>>>> Based on my current understanding: >>>>>>>>>>> >>>>>>>>>>> - In datasketches-java, string serialization already produces >>>>>>>>>>> valid UTF-8 bytes via getBytes(StandardCharsets.UTF_8). So >>>>>>>>>>> Java-generated >>>>>>>>>>> artifacts already assume valid UTF-8 string encoding. >>>>>>>>>>> - Rust and Python string types represent Unicode text and can be >>>>>>>>>>> encoded to UTF-8. Please correct me if I am mistaken. (I don't know >>>>>>>>>>> Rust >>>>>>>>>>> and Python well) >>>>>>>>>>> - In Go, string is a byte sequence and may contain invalid UTF-8 >>>>>>>>>>> unless explicitly validated. So during serialization, it may produce >>>>>>>>>>> invalid UTF-8 sequences. >>>>>>>>>>> - In C++, std::string is also a byte container and does not >>>>>>>>>>> enforce UTF-8 validity. So during serialization, it may produce >>>>>>>>>>> invalid >>>>>>>>>>> UTF-8 sequences. >>>>>>>>>>> >>>>>>>>>>> If I am mistaken on any of these points, I would appreciate >>>>>>>>>>> corrections. >>>>>>>>>>> >>>>>>>>>>> If we want to maintain cross-language portability for serialized >>>>>>>>>>> artifacts, one possible approach would be to ensure that any >>>>>>>>>>> serialized >>>>>>>>>>> string data is valid UTF-8. This could potentially apply to any >>>>>>>>>>> sketches >>>>>>>>>>> that serialize or deserialize string data. >>>>>>>>>>> >>>>>>>>>>> There seem to be several possible approaches: >>>>>>>>>>> - Validate UTF-8 at serialization boundaries >>>>>>>>>>> - Document that input strings must be valid UTF-8 and rely on >>>>>>>>>>> caller discipline >>>>>>>>>>> >>>>>>>>>>> At this point I am not proposing a specific solution. I would >>>>>>>>>>> like to hear opinions from the community on: We want to require >>>>>>>>>>> serialized >>>>>>>>>>> string data to be valid UTF-8 for cross-language portability >>>>>>>>>>> >>>>>>>>>>> Thanks, >>>>>>>>>>> >>>>>>>>>>> Hyeonho >>>>>>>>>>> >>>>>>>>>>
