drcrallen edited a comment on issue #6865: Densify swapped hll buffer URL: https://github.com/apache/incubator-druid/pull/6865#issuecomment-461595836 @leerho it is doing a version of sampling (but NOT event sampling) prior to sending to the sketch. Specifically the sketch is against ALL events in a specific sub-set of the data. Basically: Pick some qty of IDs. Assume that the IDs selected are a representative sample of the total population. Log all events from the IDs selected. Then sketches against the IDs should be fine for that sub-set with the knowledge that you can ONLY account for things happening in the sample population (ex: no or very very limited network effect analysis). This tends to work pretty well for quick insights on big effects. The problem comes in when someone uses a simple `hash(id) % some_number` (or something that becomes effectively that) to determine if the ID should be part of the sample set AND a hll sketch uses the same hash fn with the same seed against `id`. An example of this is included in the unit tests.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
