gianm commented on issue #6814: [Discuss] Replacing hyperUnique as 'default' distinct count sketch URL: https://github.com/apache/incubator-druid/issues/6814#issuecomment-452855302 > The Druid-HLL sketch delegates the responsibility of hashing the input items to the user. This isn't true in practice. The druid-hll library lets callers use any hash function, but Druid doesn't expose that to end users. It always uses `Hashing.murmur3_128()` to avoid the problems you mentioned. > The design flaw that is causing the poor error on merging is fundamental. The Druid-HLL sketch is not keeping sufficient information in its stored image to decode it accurately. Out of curiosity what is the missing information? I didn't see details on that in the link you provided. > I see no benefit to attempting to "fix" the Druid-HLL sketch as it would require a complete redesign from scratch. The benefit would be giving users a migration path from a possibly large number of already-stored sketches, other than "sorry, you have to reindex all of your historical data". Which, for some users, may be inconvenient or impossible. Better would be "you can migrate to a new format for newly ingested data, and it will give you better behavior for that new data, degrading to the old behavior if you query a time range that covers both the old and new formats." There is value in that if it is possible. (I don't feel like I understand the details well enough to say if it is possible.) > But we have already designed a far superior HLL sketch as part of the DataSketches library. I strongly suggest we document and communicate these issues to the user community, and suggest they migrate to the DataSketches library. It does sound like you have designed a better one and thank you for that!
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
