Thanks, everyone, for the discussion. My understanding from the discussion so far is that the policy direction should be:
- For string-containing sketches, interoperability pitfalls related to UTF-8 encoding should be documented clearly. - Optional helper tools for common cases may be useful, but they do not seem essential to the policy itself. Given that, I think the most practical next step is documentation. As far as I know, this is not documented clearly today for C++ and Go, so I will follow up by proposing documentation updates there as a next step. On Sun, Mar 8, 2026 at 2:45 PM Lee Rhodes <[email protected]> wrote: > This has been a helpful discussion. My thinking about this has also > changed, but for a different reason. > > My proposal to have an encoding standard for strings came from a (noble?) > desire to help protect our users from footguns. > > However, ensuring compatibility between any two sketches that have been > independently loaded is a much deeper can-of-worms than we have discussed > here: > > - Imagine a merge of two sketches inadvertently fed strings using > different character encodings. It doesn't matter if the sketches originated > from different programming languages or not. > - Converting a string to a hash doesn't change this. This means > virtually all of our sketches could be vulnerable to this user mistake and > not just our container sketches. > - Natural numeric instability of doubles could also create similar > silent failures if the user is not careful. > > I don't think that there is any way we can programmatically protect our > users from all of these possible mistakes. > > Having said that, providing some useful tools that could help the user > validate UTF-8 strings might be useful. It won't protect against all of the > potential user mistakes of this type, just perhaps some common ones. > > But if we decide not to do anything programmatic, we could at least > provide sufficient warnings in the documentation of these possible, and > easy to make pitfalls. We don't have to do this right away, but as the > various libraries move to new versions, this kind of documentation should > be on the list to add. > > > > > > On Sat, Mar 7, 2026 at 2:57 AM Hyeonho Kim <[email protected]> wrote: > >> Thanks. >> >> After thinking more about it and reviewing the C++ and Go code more >> closely, my view has changed. >> >> I now think that changing the serialization format just to preserve UTF-8 >> validation behavior for C++ and Go would be too heavy. If we do not change >> the serialization format, then we cannot fully preserve behavioral >> consistency across serialization/deserialization anyway. >> >> At the same time, I do not think we should ignore language-independent >> sketch images for string-containing sketches. >> So my current view is that we should keep the sketch format unchanged and >> leave `update()` behavior unchanged. >> >> If possible, we provide an explicit portability path through UTF-8 >> validating SerDe choices. >> If that is not desirable, then at minimum I think we should document this >> point clearly. In particular, I think we should document clearly that >> cross-language portability for string-containing sketches depends on using >> valid UTF-8. >> >> >> On Sat, Mar 7, 2026 at 4:47 PM Alexander Saydakov via dev < >> [email protected]> wrote: >> >>> I would reiterate that in my view sketches should not care about >>> validation. >>> If the user desires validation, he can instantiate, say, >>> frequent_items_sketch<utf8_string> instead of >>> frequent_items_sketch<std::string>. >>> utf8_string should perform validation. >>> >>> On Fri, Mar 6, 2026 at 10:17 PM Hyeonho Kim <[email protected]> wrote: >>> >>>> Hi all, >>>> >>>> I realized there is one more design point that may need discussion. >>>> >>>> For sketches that validate UTF-8 at update() time by default, with an >>>> explicit opt-out, that setting affects the behavior of future update() >>>> calls even after deserialization. >>>> >>>> So there seems to be a broader design choice here for string-specific >>>> sketches / update APIs: >>>> >>>> 1. >>>> >>>> Treat the UTF-8 validation setting as part of the serialized sketch >>>> state, so it is preserved across serialization/deserialization. >>>> 2. >>>> >>>> Treat it as a runtime policy only, in which case it would need to >>>> be specified again after deserialization (or when constructing a new >>>> sketch). >>>> >>>> The first option would preserve behavioral consistency, so it seems >>>> like the more semantically consistent choice. However, it also seems like a >>>> much bigger decision in practice, since it would require a serialization >>>> format change / versioning. >>>> >>>> The second option avoids changing the serialized format, but a >>>> deserialized sketch may not behave exactly the same for future update() >>>> calls unless the caller explicitly restores the same policy. >>>> >>>> What do others think? >>>> >>>> On Wed, Mar 4, 2026 at 5:30 AM Lee Rhodes <[email protected]> wrote: >>>> >>>>> I agree. Here is a proposed wording that is a sort of a "policy" way >>>>> to think about this: >>>>> >>>>> For "container" type sketches that can potentially retain Strings: >>>>> >>>>> - If a sketch has the word "string" as part of its name, then >>>>> UTF-8 validation at update() should be the default with an explicit >>>>> opt-out. Example: ArrayOfStringsTupleSketch. >>>>> - If an update method to a sketch has an explicit "string" >>>>> parameter, then UTF-8 validation should be the default with an explicit >>>>> opt-out. Example FdtSketch::update(String[]). >>>>> - Otherwise, if a sketch or update method accepts just a generic >>>>> type T, then we will provide a UTF-8 validating "SerDe" object that >>>>> can be >>>>> optionally used for type T. >>>>> >>>>> >>>>> >>>>> On Tue, Mar 3, 2026 at 7:32 AM Hyeonho Kim <[email protected]> wrote: >>>>> >>>>>> Hi all! >>>>>> >>>>>> Unless there are objections, I propose the following: >>>>>> >>>>>> 1. >>>>>> >>>>>> Introduce an opt-in UTF-8 validating SerDe for std::string >>>>>> (validation OFF by default). >>>>>> 2. >>>>>> >>>>>> For AoS string items, enable UTF-8 validation at update() by >>>>>> default, with an explicit opt-out. >>>>>> >>>>>> If this direction looks reasonable, I will proceed accordingly in the >>>>>> AoS PR and follow up with a separate PR for the SerDe option. >>>>>> >>>>>> >>>>>> Thanks, >>>>>> >>>>>> Hyeonho >>>>>> >>>>>> On Fri, Feb 20, 2026 at 11:59 PM Hyeonho Kim <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Thanks all for the feedback. >>>>>>> >>>>>>> >>>>>>> We can preserve backward compatibility for existing C++ users while >>>>>>> also providing a clear path for cross-language portability. >>>>>>> >>>>>>> How do you think about the following approach? >>>>>>> >>>>>>> - SerDe with string: Add an option to validate whether the string >>>>>>> contains valid UTF-8 sequences. The default would be validation OFF to >>>>>>> preserve existing compatibility. >>>>>>> >>>>>>> - AoS tuple sketch: Validate UTF-8 at the update method (fail-fast). >>>>>>> Enabling validation by default, with an explicit opt-out for users who >>>>>>> want. >>>>>>> >>>>>>> >>>>>>> For DS-Go, we can follow the same policy as C++. >>>>>>> >>>>>>> >>>>>>> Feedback is welcome. >>>>>>> >>>>>>> On Wed, Feb 18, 2026 at 3:24 AM Jon Malkin <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Gonna agree with Alexander here. I think we should provide a serde >>>>>>>> option for c++, but that we should not reject non-UTF-8 strings. >>>>>>>> >>>>>>>> That wouldn’t just be an API-breaking change. It would break >>>>>>>> compatibility of c++ with itself for anyone who doesn’t need language >>>>>>>> portability. >>>>>>>> >>>>>>>> A separate utf8_serde option gets my vote. >>>>>>>> >>>>>>>> jon >>>>>>>> >>>>>>>> On Tue, Feb 17, 2026 at 10:12 AM Alexander Saydakov via dev < >>>>>>>> [email protected]> wrote: >>>>>>>> >>>>>>>>> Regarding C++, I would think that the easiest approach is to >>>>>>>>> instruct the user to use a UTF8-validating string substitute instead >>>>>>>>> of >>>>>>>>> std::string. >>>>>>>>> I am not sure whether we should provide such a thing or let the >>>>>>>>> user to come up with their own implementation. >>>>>>>>> Consider having a uft8_string that would validate the input in the >>>>>>>>> constrtuctor but otherwise identical to std::string >>>>>>>>> So the user can instantiate, for example, >>>>>>>>> frequent_items_sketch<utf8_string> instead of >>>>>>>>> frequent_items_sketch<std::string> if validation is necessary. >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sun, Feb 15, 2026 at 8:38 PM Hyeonho Kim <[email protected]> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Thanks for the feedback. I agree that for container sketches that >>>>>>>>>> retain and serialize strings, we should validate that string >>>>>>>>>> payloads are >>>>>>>>>> valid UTF-8 sequences to preserve cross-language portability. >>>>>>>>>> >>>>>>>>>> On *where* to validate in DS-CPP: validating at update() (ingest >>>>>>>>>> time) is attractive because it is fail-fast, but it also adds >>>>>>>>>> additional >>>>>>>>>> cost on the hot path. If the community is comfortable with that >>>>>>>>>> overhead >>>>>>>>>> for string-based container sketches, I’m happy to pursue the >>>>>>>>>> update()-time validation approach. >>>>>>>>>> >>>>>>>>>> If performance sensitivity is a concern, an alternative would be >>>>>>>>>> to always validate at (de)serialization boundaries (to guarantee >>>>>>>>>> artifact >>>>>>>>>> correctness), and optionally provide a “fail-fast” mode that enables >>>>>>>>>> validation at update() as well. >>>>>>>>>> >>>>>>>>>> For DS-Go, we can follow the same policy. Go’s situation is a bit >>>>>>>>>> simpler in implementation because it provides UTF-8 validation in the >>>>>>>>>> standard library (unicode/utf8), so we wouldn’t need an external >>>>>>>>>> dependency for the validator. >>>>>>>>>> >>>>>>>>>> On Mon, Feb 16, 2026 at 6:29 AM Lee Rhodes <[email protected]> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> This issue, raised by Hyeonho Kim, relates to sketches that >>>>>>>>>>> allow a user to update the sketch with a string and the sketch also >>>>>>>>>>> retains >>>>>>>>>>> within the sketch a sample of the input strings seen. When >>>>>>>>>>> serialized, >>>>>>>>>>> there is an implicit assumption that another user, possibly in a >>>>>>>>>>> different >>>>>>>>>>> language, can successfully deserialize those sketch images. These >>>>>>>>>>> sketches >>>>>>>>>>> include KLL, REQ, Classic Quantiles, Sampling, FrequentItems, >>>>>>>>>>> and Tuple. We informally call these "container" sketches, because >>>>>>>>>>> they >>>>>>>>>>> contain actual samples from the input stream. HLL, Theta, CPC, >>>>>>>>>>> BloomFilter, etc., are not container sketches. >>>>>>>>>>> >>>>>>>>>>> In the DS-Java library, all container sketches that allow >>>>>>>>>>> strings always use UTF_8. So the sketch images produced will >>>>>>>>>>> contain proper >>>>>>>>>>> UTF_8 sequences. >>>>>>>>>>> >>>>>>>>>>> In the DS-CPP library, all the various data types are abstracted >>>>>>>>>>> via templates. The serialization operation is declared similar to >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> *sketch<T>::serialize(std::ostream& os, const SerDe& sd)where T *is >>>>>>>>>>> the item type*, os is the output stream and sd* *is the SerDe >>>>>>>>>>> that performs the conversion to bytes. * >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> If the user wants to use an item of type string, *T* would >>>>>>>>>>> typically be of type *std::string*, which is just a blob of >>>>>>>>>>> bytes and no requirement that it is UTF_8. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> So far, we have trusted users of the library to know that if >>>>>>>>>>> they update one of these container classes with a type *T,* >>>>>>>>>>> that the downstream user can successfully decode it. But this could >>>>>>>>>>> be >>>>>>>>>>> catastrophic: A downstream user of a sketch image could be >>>>>>>>>>> separated from >>>>>>>>>>> the creation of the sketch image by years and be using a different >>>>>>>>>>> language. >>>>>>>>>>> >>>>>>>>>>> One of the big advantages of our DataSketches project is that >>>>>>>>>>> our serialization images should be language and platform >>>>>>>>>>> independent, >>>>>>>>>>> allowing cross-language and cross platform interchange of sketches. >>>>>>>>>>> >>>>>>>>>>> Hyeonho Kim's recommendation makes sense: For serialized sketch >>>>>>>>>>> images that contain strings, those strings must be UTF_8. >>>>>>>>>>> >>>>>>>>>>> So how do we implement that? My thoughts are as follows: >>>>>>>>>>> >>>>>>>>>>> 1. We should document now in the website and in appropriate >>>>>>>>>>> places in the library the potential danger of not using UTF_8 >>>>>>>>>>> strings. (At >>>>>>>>>>> least until we have a more robust solution) >>>>>>>>>>> 2. I think implementing validation checks on UTF_8 strings >>>>>>>>>>> at the SerDe boundaries may be too late. A user could have >>>>>>>>>>> processed a >>>>>>>>>>> large stream of data only to discover a failure at serialization >>>>>>>>>>> time, >>>>>>>>>>> which could be much later in time. The other possibility would >>>>>>>>>>> be to >>>>>>>>>>> validate the strings at the input into the sketch, typically in >>>>>>>>>>> the *update() >>>>>>>>>>> *method. >>>>>>>>>>> 3. For C++, there are 3rd party libraries that specialize in >>>>>>>>>>> UTF_8 validation, including ICU >>>>>>>>>>> >>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/unicode-org/icu__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpPC5K7q2A$> >>>>>>>>>>> , UTF8-CPP >>>>>>>>>>> >>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/nemtrif/utfcpp__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpNk0hS7xg$> >>>>>>>>>>> and simjson >>>>>>>>>>> >>>>>>>>>>> <https://urldefense.com/v3/__https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpMVUko1NQ$>. >>>>>>>>>>> (These have standard licensing). From what I've read, UTF-8 >>>>>>>>>>> validation, if >>>>>>>>>>> done correctly, can be done very fast, with only a small section >>>>>>>>>>> of code. >>>>>>>>>>> 4. I am not sure what the solutions are for Rust or Go. >>>>>>>>>>> >>>>>>>>>>> I welcome your feedback. >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Sat, Feb 14, 2026 at 1:47 AM tison <[email protected]> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> This PR [1] of datasketches-rust demonstrates how the Rust impl >>>>>>>>>>>> deserializes String values. >>>>>>>>>>>> >>>>>>>>>>>> [1] https://github.com/apache/datasketches-rust/pull/82 >>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/apache/datasketches-rust/pull/82__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpN3yo3d3w$> >>>>>>>>>>>> >>>>>>>>>>>> If it's std::string::String, then it must be of UTF-8 encoding. >>>>>>>>>>>> And we check the encoding on deserialization. >>>>>>>>>>>> >>>>>>>>>>>> However, the Rust ecosystem also supports "strings" that do not >>>>>>>>>>>> use UTF-8, such as BStr. >>>>>>>>>>>> >>>>>>>>>>>> So, my opinions are: >>>>>>>>>>>> >>>>>>>>>>>> 1. It's good to assume serialized string data to be valid UTF-8. >>>>>>>>>>>> 2. Even if it isn't, for datasketches-rust, users should be >>>>>>>>>>>> able to choose a proper type to deserialize the bytes into a type >>>>>>>>>>>> that >>>>>>>>>>>> doesn't require UTF-8 encoding. >>>>>>>>>>>> >>>>>>>>>>>> Best, >>>>>>>>>>>> tison. >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Hyeonho Kim <[email protected]> 于2026年2月14日周六 17:24写道: >>>>>>>>>>>> >>>>>>>>>>>>> Hi all, >>>>>>>>>>>>> >>>>>>>>>>>>> While working on UTF-8 validation for the AoS tuple sketch in >>>>>>>>>>>>> C++ (ref: https://github.com/apache/datasketches-cpp/pull/476 >>>>>>>>>>>>> <https://urldefense.com/v3/__https://github.com/apache/datasketches-cpp/pull/476__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpPslrtDnQ$>), >>>>>>>>>>>>> a broader design question came up that may affect multiple >>>>>>>>>>>>> sketches. >>>>>>>>>>>>> >>>>>>>>>>>>> Based on my current understanding: >>>>>>>>>>>>> >>>>>>>>>>>>> - In datasketches-java, string serialization already produces >>>>>>>>>>>>> valid UTF-8 bytes via getBytes(StandardCharsets.UTF_8). So >>>>>>>>>>>>> Java-generated >>>>>>>>>>>>> artifacts already assume valid UTF-8 string encoding. >>>>>>>>>>>>> - Rust and Python string types represent Unicode text and can >>>>>>>>>>>>> be encoded to UTF-8. Please correct me if I am mistaken. (I don't >>>>>>>>>>>>> know Rust >>>>>>>>>>>>> and Python well) >>>>>>>>>>>>> - In Go, string is a byte sequence and may contain invalid >>>>>>>>>>>>> UTF-8 unless explicitly validated. So during serialization, it >>>>>>>>>>>>> may produce >>>>>>>>>>>>> invalid UTF-8 sequences. >>>>>>>>>>>>> - In C++, std::string is also a byte container and does not >>>>>>>>>>>>> enforce UTF-8 validity. So during serialization, it may produce >>>>>>>>>>>>> invalid >>>>>>>>>>>>> UTF-8 sequences. >>>>>>>>>>>>> >>>>>>>>>>>>> If I am mistaken on any of these points, I would appreciate >>>>>>>>>>>>> corrections. >>>>>>>>>>>>> >>>>>>>>>>>>> If we want to maintain cross-language portability for >>>>>>>>>>>>> serialized artifacts, one possible approach would be to ensure >>>>>>>>>>>>> that any >>>>>>>>>>>>> serialized string data is valid UTF-8. This could potentially >>>>>>>>>>>>> apply to any >>>>>>>>>>>>> sketches that serialize or deserialize string data. >>>>>>>>>>>>> >>>>>>>>>>>>> There seem to be several possible approaches: >>>>>>>>>>>>> - Validate UTF-8 at serialization boundaries >>>>>>>>>>>>> - Document that input strings must be valid UTF-8 and rely on >>>>>>>>>>>>> caller discipline >>>>>>>>>>>>> >>>>>>>>>>>>> At this point I am not proposing a specific solution. I would >>>>>>>>>>>>> like to hear opinions from the community on: We want to require >>>>>>>>>>>>> serialized >>>>>>>>>>>>> string data to be valid UTF-8 for cross-language portability >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks, >>>>>>>>>>>>> >>>>>>>>>>>>> Hyeonho >>>>>>>>>>>>> >>>>>>>>>>>>
