drcrallen edited a comment on issue #6865: Densify swapped hll buffer
URL: https://github.com/apache/incubator-druid/pull/6865#issuecomment-461595836
 
 
   @leerho it is doing a version of sampling (but NOT event sampling) prior to 
sending to the sketch. Specifically the sketch is against ALL events in a 
specific sub-set of the data.
   
   Basically: Pick some qty of IDs. Assume that the IDs selected are a 
representative sample of the total population. Log all events from the IDs 
selected. Then sketches against the IDs should be fine for that sub-set with 
the knowledge that you can ONLY account for things happening in the sample 
population (ex: no or very very limited network effect analysis). 
   
   This tends to work pretty well for quick insights on big effects. The 
problem comes in when someone uses a simple `hash(id) % some_number` (or 
something that becomes effectively that) to determine if the ID should be part 
of the sample set AND a hll sketch uses the same hash fn with the same seed 
against `id`. An example of this is included in the unit tests.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to