proost commented on PR #476: URL: https://github.com/apache/datasketches-cpp/pull/476#issuecomment-3888349233
@leerho The reason I introduced UTF-8 validation is to preserve cross-language portability for the AoS tuple sketch. On the Java side, strings are Unicode and serialization produces UTF-8 bytes; if the C++ implementation accepts arbitrary std::string bytes and serializes them without checks, we can end up writing invalid UTF-8 into the sketch stream, which could fail to deserialize or behave inconsistently across languages. I agree that bringing in a third-party UTF-8 library (and its non-standard license header) is a significant cost: integration, maintenance, debugging, and licensing updates. I see three possible approaches: 1. Add a small, self-contained UTF-8 well-formedness validation routine in our codebase (C++11-compatible). This avoids external dependencies and LICENSE updates, with only a small maintenance burden. 2. Document the API contract clearly: input strings are expected to be valid UTF-8, without performing validation. 3. [Use ICU](https://github.com/unicode-org/icu). Using ICU works correct, however this introduces a large dependency, additional operational cost, and longer compile times. Based on your feedback, I propose we go with option (1). If you prefer a different direction, I can adjust. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
