proost commented on PR #476:
URL: https://github.com/apache/datasketches-cpp/pull/476#issuecomment-3888349233

   @leerho 
   The reason I introduced UTF-8 validation is to preserve cross-language 
portability for the AoS tuple sketch. On the Java side, strings are Unicode and 
serialization produces UTF-8 bytes; if the C++ implementation accepts arbitrary 
std::string bytes and serializes them without checks, we can end up writing 
invalid UTF-8 into the sketch stream, which could fail to deserialize or behave 
inconsistently across languages.
   
   I agree that bringing in a third-party UTF-8 library (and its non-standard 
license header) is a significant cost: integration, maintenance, debugging, and 
licensing updates.
   
   I see three possible approaches:
   
   1. Add a small, self-contained UTF-8 well-formedness validation routine in 
our codebase (C++11-compatible). This avoids external dependencies and LICENSE 
updates, with only a small maintenance burden.
   2. Document the API contract clearly: input strings are expected to be valid 
UTF-8, without performing validation.
   3. [Use ICU](https://github.com/unicode-org/icu). Using ICU works correct, 
however this introduces a large dependency, additional operational cost, and 
longer compile times.
   
   Based on your feedback, I propose we go with option (1). If you prefer a 
different direction, I can adjust.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to