leerho commented on issue #7187: Improve topN algorithm
URL: 
https://github.com/apache/incubator-druid/issues/7187#issuecomment-472185432
 
 
   Perhaps it is "well-known" amongst the Druid developers, but I think the 
Druid team is taking a serious risk with Druid's reputation.  Druid's customers 
may not fully understand that the current TopN functionality may return 
garbage, even if it is discussed in the documentation (who reads the manual, 
anyway  :)  ).  
   
   Quoting @peferron above:
   > Going back to your song example, could FIS support getting the top song 
titles by number of unique listeners over the past week? The **current topN can 
do that** using HLL sketches as metric (weight). That's one example of the 
usefulness of accepting any Druid aggregation or post-aggregation as a metric.
   
   It is not clear from his statement whether he realizes that "accepting any 
Druid aggregation or post-aggregation" may not work as intended and, in fact 
may produce meaningless or misleading results.  
   
   Data sensitivity is not a good thing especially given that there is no easy 
way for a customer to determine whether the data he/she has processed is 
producing trustworthy results or not.
   
   I would consider the current TopN functionality to be broken and unreliable, 
period.   
   
   Here are two alternatives (there may be others):
   - In any "TopN" operation do not allow any aggregation steps after the data 
set has been "limited".  This has to be true both on the nodes and on the 
broker.  7
   - Adopt the Frequent Items sketch for "TopN" operations.  Accept that the 
ordering weight metric is positive integers (longs) and that the aggregation 
operation is `add()`.  (If you feel you really need real-valued weights, then 
we can develop for you an FIS based on doubles.  That is not a big deal.)
   
   My point is that a better, more efficient, and free solution exists for the 
most common TopN type operations. Why not use it?  
   
   
    
   
   
   
   
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to