leerho commented on pull request #11201:
URL: https://github.com/apache/druid/pull/11201#issuecomment-834830853


   @gianm 
   I want to clarify a comment made above.
   
   > HLL sketches used UTF-16LE encoding when hashing strings. 
   
   This is not correct, at least for the HLL in datasketches-java (I'm not sure 
what the Druid adaptor does).  Strings are encoded using UTF-8 and have been 
for as long as I can remember.  If you wish to use UTF-16, you just convert 
your string to char[] and the HLL sketch will accept that as well.  The sketch 
really doesn't care what the string encoding is, it is either looking at the 
input as a stream of byte[] or char[].   The UTF-8 encoding was specified in 
the string update method to help users ensure consistency (if the string 
happened to be encoded in something else).  Nonetheless, whatever you decide, 
you will **always** need to stick with your choice.  Otherwise, you will 
destroy the unique identity of whatever you are feeding the sketch. As a result 
counts, merging, etc will be meaningless!
   
   I have some comments about [PR 
353](https://github.com/apache/datasketches-java/pull/353) but I want to make 
these in the actual PR.
   
   Lee.
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to