[GitHub] [incubator-druid] peferron edited a comment on issue #7187: Improve topN algorithm

GitBox Sat, 16 Mar 2019 00:41:23 -0700

peferron edited a comment on issue #7187: Improve topN algorithm
URL: 
https://github.com/apache/incubator-druid/issues/7187#issuecomment-473508823
 
 
   @leerho 
   
   > The **first belief** that:
   > 
   > > The current topN algorithm **can be used for any query** shaped like 
SELECT x, AGG(y) FROM table ORDER BY AGG(y) DESC LIMIT N when 
useApproximateTopN is enabled.
   > 
   > This belief is a total fiction and this claim will fail miserably.
   
   You can use `topN` with any aggregation and get back results. This is a 
fact, not a belief. The misguided belief would be that these results are always 
meaningful, but I don't recall anybody here saying that; it has been repeated 
again and again that `topN` is data-sensitive.
   
   > The **second belief**:
   > 
   > This claim of the ability to plug-in HLL sketches as the aggregator will 
also fail miserably.
   
   Same answer as above. You can absolutely plug a HLL sketch as a `topN` 
aggregation. Whether the results are meaningful or not depends on the data.
   
   > The **third** belief:
   > 
   > [...]
   > 
   > In fact the **only** "aggregation" operation that can possibly produce any 
meaningful results with the Druid TopN algorithm is a sum of positive weights.
   
   Since this assertion is worded very strongly, it can be disproven by finding 
a counter-example where meaningful results are produced using `topN` in 
conjunction with, say, a HLL sketch.
   
   That's easy to do for almost anyone with access to a Druid cluster. That 
happens to be my case, so I just ran queries against about a billion rows of 
real-world data. The queries are `SELECT x, COUNT(DISTINCT y) GROUP BY 1 ORDER 
BY 2 DESC LIMIT 5`.
   
   In the first test, `x` is a dimension with a cardinality of 30k, and `y` is 
another dimension with high cardinality. `topN` returns the same results as 
`groupBy` (which is the exact counterpart of `topN`), but 30% faster.
   
   In the second test, `x` is a dimension with a cardinality of 2M. `groupBy` 
cannot return results because it exceeds resource limits. After reducing the 
query interval enough to bring the cardinality of `x` down to levels that 
`groupBy` can handle, `topN` again returns the same results as `groupBy`, but 
75% faster.
   
   So, here we are—meaningful results produced using `topN` with HLL, and with 
significant performance gains.
   
   These are the first queries I tried, but I'm sure I could find some where 
`topN` fails miserably. That's not the point, though—the point is that `topN` 
works well in some cases, as demonstrated. You seem to think that Druid users, 
including participants in this discussion, are incapable of identifying these 
cases, and are using `topN` only because they don't understand how badly it can 
fail. That's certainly your prerogative, and you may be right for the possibly 
large number of Druid users who don't dig into how the system works or read the 
documentation. As a somewhat power user though, I'm happy to have a tool like 
`topN` at my disposal, even if it's up to me to use it responsibly.
   
   That being said, I certainly hope that Druid goes towards a direction where 
error bounds are systematically provided, as discussed in #7160. So thank you 
and @AlexanderSaydakov for starting this movement by integrating the DS library.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-druid] peferron edited a comment on issue #7187: Improve topN algorithm

Reply via email to