gianm commented on issue #6814: [Discuss] Replacing hyperUnique as 'default' 
distinct count sketch
URL: 
https://github.com/apache/incubator-druid/issues/6814#issuecomment-452855302
 
 
   > The Druid-HLL sketch delegates the responsibility of hashing the input 
items to the user.
   
   This isn't true in practice. The druid-hll library lets callers use any hash 
function, but Druid doesn't expose that to end users. It always uses 
`Hashing.murmur3_128()` to avoid the problems you mentioned.
   
   > The design flaw that is causing the poor error on merging is fundamental. 
The Druid-HLL sketch is not keeping sufficient information in its stored image 
to decode it accurately.
   
   Out of curiosity what is the missing information? I didn't see details on 
that in the link you provided.
   
   > I see no benefit to attempting to "fix" the Druid-HLL sketch as it would 
require a complete redesign from scratch.
   
   The benefit would be giving users a migration path from a possibly large 
number of already-stored sketches, other than "sorry, you have to reindex all 
of your historical data". Which, for some users, may be inconvenient or 
impossible. Better would be "you can migrate to a new format for newly ingested 
data, and it will give you better behavior for that new data, degrading to the 
old behavior if you query a time range that covers both the old and new 
formats." There is value in that if it is possible. (I don't feel like I 
understand the details well enough to say if it is possible.)
   
   > But we have already designed a far superior HLL sketch as part of the 
DataSketches library. I strongly suggest we document and communicate these 
issues to the user community, and suggest they migrate to the DataSketches 
library.
   
   It does sound like you have designed a better one and thank you for that!

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to