leerho commented on issue #7187: Improve topN algorithm
URL: 
https://github.com/apache/incubator-druid/issues/7187#issuecomment-469439146
 
 
   ### Objective
   - Replace the current internal heuristic algorithm for "TopN" with the 
DataSketches [Frequent Items Sketch 
(FIS)](https://datasketches.github.io/docs/FrequentItems/FrequentItemsOverview.html).
    
   ### Motivation
   - **Error vs. Size**: The FIS has user configureable parameter that allows 
tradeoff of the sketch size and its resulting accuracy (or what we call its 
error threshold, or epsilon).
   - **Error Clarity**: The FIS has well-defined error bounds that can be 
applied to each item in the result set.
   - **Query Flexibility**: At query time the user can choose two different 
query modes NO_FALSE_POSITIVES (more strict) or NO_FALSE_NEGATIVES (less 
strict).  The FIS will not return items below the noise floor. So you know that 
the items it does return are, in fact, significant.
   - **Streaming**: The FIS is a streaming algorithm, just pass the input data 
into the sketch once.  No need to presort the input stream.  This will save 
processing cycles and time.
   - **Mergeable**: The FIS is mergeable.  Each node would sketch its own data 
independently, then the sketches from each node are passed to a central 
"broker" node where they are merged into a final sketch, which can then be 
queried. This could reduce the amount of data passing between the nodes and the 
broker, but it will depend on how the sketch is configured.  There is no loss 
of accuracy due to the merging process.
   - **Data Independent**: The FIS error properties are independent of how the 
input data is partitioned into separate streams (the nodes) and how the values 
and their weights are distributed within the streams themselves.  The FIS can 
be somewhat input data order sensitive, in that the weights and order of the 
items in the result set may vary slightly from run to run, but they will always 
be within the error specification of the chosen sketch configuration.
   
   ### Making it Happen!
   We on the sketches team are not knowledgeable about Druid internals.  We 
certainly can advise and consult on the proper use of the sketches, but it must 
be someone on the Druid team to integrate the sketch library into Druid.
    
    
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to