leerho commented on issue #7187: Improve topN algorithm URL: https://github.com/apache/incubator-druid/issues/7187#issuecomment-469439146 ### Objective - Replace the current internal heuristic algorithm for "TopN" with the DataSketches [Frequent Items Sketch (FIS)](https://datasketches.github.io/docs/FrequentItems/FrequentItemsOverview.html). ### Motivation - **Error vs. Size**: The FIS has user configureable parameter that allows tradeoff of the sketch size and its resulting accuracy (or what we call its error threshold, or epsilon). - **Error Clarity**: The FIS has well-defined error bounds that can be applied to each item in the result set. - **Query Flexibility**: At query time the user can choose two different query modes NO_FALSE_POSITIVES (more strict) or NO_FALSE_NEGATIVES (less strict). The FIS will not return items below the noise floor. So you know that the items it does return are, in fact, significant. - **Streaming**: The FIS is a streaming algorithm, just pass the input data into the sketch once. No need to presort the input stream. This will save processing cycles and time. - **Mergeable**: The FIS is mergeable. Each node would sketch its own data independently, then the sketches from each node are passed to a central "broker" node where they are merged into a final sketch, which can then be queried. This could reduce the amount of data passing between the nodes and the broker, but it will depend on how the sketch is configured. There is no loss of accuracy due to the merging process. - **Data Independent**: The FIS error properties are independent of how the input data is partitioned into separate streams (the nodes) and how the values and their weights are distributed within the streams themselves. The FIS can be somewhat input data order sensitive, in that the weights and order of the items in the result set may vary slightly from run to run, but they will always be within the error specification of the chosen sketch configuration. ### Making it Happen! We on the sketches team are not knowledgeable about Druid internals. We certainly can advise and consult on the proper use of the sketches, but it must be someone on the Druid team to integrate the sketch library into Druid.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
