I would reiterate that in my view sketches should not care about validation. If the user desires validation, he can instantiate, say, frequent_items_sketch<utf8_string> instead of frequent_items_sketch<std::string>. utf8_string should perform validation.
On Fri, Mar 6, 2026 at 10:17 PM Hyeonho Kim <[email protected]> wrote: > Hi all, > > I realized there is one more design point that may need discussion. > > For sketches that validate UTF-8 at update() time by default, with an > explicit opt-out, that setting affects the behavior of future update() > calls even after deserialization. > > So there seems to be a broader design choice here for string-specific > sketches / update APIs: > > 1. > > Treat the UTF-8 validation setting as part of the serialized sketch > state, so it is preserved across serialization/deserialization. > 2. > > Treat it as a runtime policy only, in which case it would need to be > specified again after deserialization (or when constructing a new sketch). > > The first option would preserve behavioral consistency, so it seems like > the more semantically consistent choice. However, it also seems like a much > bigger decision in practice, since it would require a serialization format > change / versioning. > > The second option avoids changing the serialized format, but a > deserialized sketch may not behave exactly the same for future update() > calls unless the caller explicitly restores the same policy. > > What do others think? > > On Wed, Mar 4, 2026 at 5:30 AM Lee Rhodes <[email protected]> wrote: > >> I agree. Here is a proposed wording that is a sort of a "policy" way to >> think about this: >> >> For "container" type sketches that can potentially retain Strings: >> >> - If a sketch has the word "string" as part of its name, then UTF-8 >> validation at update() should be the default with an explicit >> opt-out. Example: ArrayOfStringsTupleSketch. >> - If an update method to a sketch has an explicit "string" parameter, >> then UTF-8 validation should be the default with an explicit opt-out. >> Example FdtSketch::update(String[]). >> - Otherwise, if a sketch or update method accepts just a generic type >> T, then we will provide a UTF-8 validating "SerDe" object that can be >> optionally used for type T. >> >> >> >> On Tue, Mar 3, 2026 at 7:32 AM Hyeonho Kim <[email protected]> wrote: >> >>> Hi all! >>> >>> Unless there are objections, I propose the following: >>> >>> 1. >>> >>> Introduce an opt-in UTF-8 validating SerDe for std::string >>> (validation OFF by default). >>> 2. >>> >>> For AoS string items, enable UTF-8 validation at update() by >>> default, with an explicit opt-out. >>> >>> If this direction looks reasonable, I will proceed accordingly in the >>> AoS PR and follow up with a separate PR for the SerDe option. >>> >>> >>> Thanks, >>> >>> Hyeonho >>> >>> On Fri, Feb 20, 2026 at 11:59 PM Hyeonho Kim <[email protected]> wrote: >>> >>>> Thanks all for the feedback. >>>> >>>> >>>> We can preserve backward compatibility for existing C++ users while >>>> also providing a clear path for cross-language portability. >>>> >>>> How do you think about the following approach? >>>> >>>> - SerDe with string: Add an option to validate whether the string >>>> contains valid UTF-8 sequences. The default would be validation OFF to >>>> preserve existing compatibility. >>>> >>>> - AoS tuple sketch: Validate UTF-8 at the update method (fail-fast). >>>> Enabling validation by default, with an explicit opt-out for users who >>>> want. >>>> >>>> >>>> For DS-Go, we can follow the same policy as C++. >>>> >>>> >>>> Feedback is welcome. >>>> >>>> On Wed, Feb 18, 2026 at 3:24 AM Jon Malkin <[email protected]> >>>> wrote: >>>> >>>>> Gonna agree with Alexander here. I think we should provide a serde >>>>> option for c++, but that we should not reject non-UTF-8 strings. >>>>> >>>>> That wouldn’t just be an API-breaking change. It would break >>>>> compatibility of c++ with itself for anyone who doesn’t need language >>>>> portability. >>>>> >>>>> A separate utf8_serde option gets my vote. >>>>> >>>>> jon >>>>> >>>>> On Tue, Feb 17, 2026 at 10:12 AM Alexander Saydakov via dev < >>>>> [email protected]> wrote: >>>>> >>>>>> Regarding C++, I would think that the easiest approach is to instruct >>>>>> the user to use a UTF8-validating string substitute instead of >>>>>> std::string. >>>>>> I am not sure whether we should provide such a thing or let the user >>>>>> to come up with their own implementation. >>>>>> Consider having a uft8_string that would validate the input in the >>>>>> constrtuctor but otherwise identical to std::string >>>>>> So the user can instantiate, for example, >>>>>> frequent_items_sketch<utf8_string> instead of >>>>>> frequent_items_sketch<std::string> if validation is necessary. >>>>>> >>>>>> >>>>>> On Sun, Feb 15, 2026 at 8:38 PM Hyeonho Kim <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> Thanks for the feedback. I agree that for container sketches that >>>>>>> retain and serialize strings, we should validate that string payloads >>>>>>> are >>>>>>> valid UTF-8 sequences to preserve cross-language portability. >>>>>>> >>>>>>> On *where* to validate in DS-CPP: validating at update() (ingest >>>>>>> time) is attractive because it is fail-fast, but it also adds additional >>>>>>> cost on the hot path. If the community is comfortable with that overhead >>>>>>> for string-based container sketches, I’m happy to pursue the >>>>>>> update()-time validation approach. >>>>>>> >>>>>>> If performance sensitivity is a concern, an alternative would be to >>>>>>> always validate at (de)serialization boundaries (to guarantee artifact >>>>>>> correctness), and optionally provide a “fail-fast” mode that enables >>>>>>> validation at update() as well. >>>>>>> >>>>>>> For DS-Go, we can follow the same policy. Go’s situation is a bit >>>>>>> simpler in implementation because it provides UTF-8 validation in the >>>>>>> standard library (unicode/utf8), so we wouldn’t need an external >>>>>>> dependency for the validator. >>>>>>> >>>>>>> On Mon, Feb 16, 2026 at 6:29 AM Lee Rhodes <[email protected]> wrote: >>>>>>> >>>>>>>> This issue, raised by Hyeonho Kim, relates to sketches that allow a >>>>>>>> user to update the sketch with a string and the sketch also retains >>>>>>>> within >>>>>>>> the sketch a sample of the input strings seen. When serialized, there >>>>>>>> is an >>>>>>>> implicit assumption that another user, possibly in a different >>>>>>>> language, >>>>>>>> can successfully deserialize those sketch images. These sketches >>>>>>>> include KLL, >>>>>>>> REQ, Classic Quantiles, Sampling, FrequentItems, and Tuple. We >>>>>>>> informally call these "container" sketches, because they contain actual >>>>>>>> samples from the input stream. HLL, Theta, CPC, BloomFilter, etc., >>>>>>>> are not >>>>>>>> container sketches. >>>>>>>> >>>>>>>> In the DS-Java library, all container sketches that allow strings >>>>>>>> always use UTF_8. So the sketch images produced will contain proper >>>>>>>> UTF_8 >>>>>>>> sequences. >>>>>>>> >>>>>>>> In the DS-CPP library, all the various data types are abstracted >>>>>>>> via templates. The serialization operation is declared similar to >>>>>>>> >>>>>>>> >>>>>>>> *sketch<T>::serialize(std::ostream& os, const SerDe& sd)where T *is >>>>>>>> the item type*, os is the output stream and sd* *is the SerDe that >>>>>>>> performs the conversion to bytes. * >>>>>>>> >>>>>>>> >>>>>>>> If the user wants to use an item of type string, *T* would >>>>>>>> typically be of type *std::string*, which is just a blob of bytes >>>>>>>> and no requirement that it is UTF_8. >>>>>>>> >>>>>>>> >>>>>>>> So far, we have trusted users of the library to know that if they >>>>>>>> update one of these container classes with a type *T,* that the >>>>>>>> downstream user can successfully decode it. But this could be >>>>>>>> catastrophic: A downstream user of a sketch image could be separated >>>>>>>> from >>>>>>>> the creation of the sketch image by years and be using a different >>>>>>>> language. >>>>>>>> >>>>>>>> One of the big advantages of our DataSketches project is that our >>>>>>>> serialization images should be language and platform independent, >>>>>>>> allowing >>>>>>>> cross-language and cross platform interchange of sketches. >>>>>>>> >>>>>>>> Hyeonho Kim's recommendation makes sense: For serialized sketch >>>>>>>> images that contain strings, those strings must be UTF_8. >>>>>>>> >>>>>>>> So how do we implement that? My thoughts are as follows: >>>>>>>> >>>>>>>> 1. We should document now in the website and in appropriate >>>>>>>> places in the library the potential danger of not using UTF_8 >>>>>>>> strings. (At >>>>>>>> least until we have a more robust solution) >>>>>>>> 2. I think implementing validation checks on UTF_8 strings at >>>>>>>> the SerDe boundaries may be too late. A user could have processed >>>>>>>> a large >>>>>>>> stream of data only to discover a failure at serialization time, >>>>>>>> which >>>>>>>> could be much later in time. The other possibility would be to >>>>>>>> validate >>>>>>>> the strings at the input into the sketch, typically in the *update() >>>>>>>> *method. >>>>>>>> 3. For C++, there are 3rd party libraries that specialize in >>>>>>>> UTF_8 validation, including ICU >>>>>>>> >>>>>>>> <https://urldefense.com/v3/__https://github.com/unicode-org/icu__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpPC5K7q2A$> >>>>>>>> , UTF8-CPP >>>>>>>> >>>>>>>> <https://urldefense.com/v3/__https://github.com/nemtrif/utfcpp__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpNk0hS7xg$> >>>>>>>> and simjson >>>>>>>> >>>>>>>> <https://urldefense.com/v3/__https://lemire.me/blog/2020/10/20/ridiculously-fast-unicode-utf-8-validation/__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpMVUko1NQ$>. >>>>>>>> (These have standard licensing). From what I've read, UTF-8 >>>>>>>> validation, if >>>>>>>> done correctly, can be done very fast, with only a small section of >>>>>>>> code. >>>>>>>> 4. I am not sure what the solutions are for Rust or Go. >>>>>>>> >>>>>>>> I welcome your feedback. >>>>>>>> >>>>>>>> >>>>>>>> On Sat, Feb 14, 2026 at 1:47 AM tison <[email protected]> wrote: >>>>>>>> >>>>>>>>> This PR [1] of datasketches-rust demonstrates how the Rust impl >>>>>>>>> deserializes String values. >>>>>>>>> >>>>>>>>> [1] https://github.com/apache/datasketches-rust/pull/82 >>>>>>>>> <https://urldefense.com/v3/__https://github.com/apache/datasketches-rust/pull/82__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpN3yo3d3w$> >>>>>>>>> >>>>>>>>> If it's std::string::String, then it must be of UTF-8 encoding. >>>>>>>>> And we check the encoding on deserialization. >>>>>>>>> >>>>>>>>> However, the Rust ecosystem also supports "strings" that do not >>>>>>>>> use UTF-8, such as BStr. >>>>>>>>> >>>>>>>>> So, my opinions are: >>>>>>>>> >>>>>>>>> 1. It's good to assume serialized string data to be valid UTF-8. >>>>>>>>> 2. Even if it isn't, for datasketches-rust, users should be able >>>>>>>>> to choose a proper type to deserialize the bytes into a type that >>>>>>>>> doesn't >>>>>>>>> require UTF-8 encoding. >>>>>>>>> >>>>>>>>> Best, >>>>>>>>> tison. >>>>>>>>> >>>>>>>>> >>>>>>>>> Hyeonho Kim <[email protected]> 于2026年2月14日周六 17:24写道: >>>>>>>>> >>>>>>>>>> Hi all, >>>>>>>>>> >>>>>>>>>> While working on UTF-8 validation for the AoS tuple sketch in C++ >>>>>>>>>> (ref: https://github.com/apache/datasketches-cpp/pull/476 >>>>>>>>>> <https://urldefense.com/v3/__https://github.com/apache/datasketches-cpp/pull/476__;!!Op6eflyXZCqGR5I!Hr1GVWHWpCX58DUhmQXYJ9srUYP2YzNW09vCpXOXZ8v4t3inaSAg9EewqhWEuJKCGoolYxZAnpPslrtDnQ$>), >>>>>>>>>> a broader design question came up that may affect multiple sketches. >>>>>>>>>> >>>>>>>>>> Based on my current understanding: >>>>>>>>>> >>>>>>>>>> - In datasketches-java, string serialization already produces >>>>>>>>>> valid UTF-8 bytes via getBytes(StandardCharsets.UTF_8). So >>>>>>>>>> Java-generated >>>>>>>>>> artifacts already assume valid UTF-8 string encoding. >>>>>>>>>> - Rust and Python string types represent Unicode text and can be >>>>>>>>>> encoded to UTF-8. Please correct me if I am mistaken. (I don't know >>>>>>>>>> Rust >>>>>>>>>> and Python well) >>>>>>>>>> - In Go, string is a byte sequence and may contain invalid UTF-8 >>>>>>>>>> unless explicitly validated. So during serialization, it may produce >>>>>>>>>> invalid UTF-8 sequences. >>>>>>>>>> - In C++, std::string is also a byte container and does not >>>>>>>>>> enforce UTF-8 validity. So during serialization, it may produce >>>>>>>>>> invalid >>>>>>>>>> UTF-8 sequences. >>>>>>>>>> >>>>>>>>>> If I am mistaken on any of these points, I would appreciate >>>>>>>>>> corrections. >>>>>>>>>> >>>>>>>>>> If we want to maintain cross-language portability for serialized >>>>>>>>>> artifacts, one possible approach would be to ensure that any >>>>>>>>>> serialized >>>>>>>>>> string data is valid UTF-8. This could potentially apply to any >>>>>>>>>> sketches >>>>>>>>>> that serialize or deserialize string data. >>>>>>>>>> >>>>>>>>>> There seem to be several possible approaches: >>>>>>>>>> - Validate UTF-8 at serialization boundaries >>>>>>>>>> - Document that input strings must be valid UTF-8 and rely on >>>>>>>>>> caller discipline >>>>>>>>>> >>>>>>>>>> At this point I am not proposing a specific solution. I would >>>>>>>>>> like to hear opinions from the community on: We want to require >>>>>>>>>> serialized >>>>>>>>>> string data to be valid UTF-8 for cross-language portability >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> >>>>>>>>>> Hyeonho >>>>>>>>>> >>>>>>>>>
