hfukada commented on PR #15049:
URL: https://github.com/apache/druid/pull/15049#issuecomment-1739319642

   > Nice. I had taken a cursory look at this paper a while back. But it wasn't 
clear to me if DDSketche has any advantage over 
https://datasketches.apache.org/docs/Quantiles/QuantilesSketchOverview.html. We 
already use data sketches library in druid. Have you looked at this library?
   
   We have been trying to make QuantilesSketches, specifically the 
`quantilesDoublesSketch` for over a year. The QuantilesSketches provided out of 
the box perform well for ranks in the middle as shown in this chart etc
   
   
![](https://datasketches.apache.org/docs/img/quantiles/DSQsketchK256_StreamA_CDF.png)
   This is due to the distributions around the p25-p75 are far more forgiving 
when operating with rank error (the error guarantee that QuantilesSketches 
provide). p50 values do not differ much from p51.
   
   but as the DDSketch paper points out, when the main operation is capturing 
the p9x values, it is not as accurate when using sketches that operate on rank 
error. a p98 can vary wildly from p99 in a long-tail distribution. This is 
where DDSketch comes in, it provides a relative-to-actual-value error guarantee 
(uniform error). We are happy with its performance and accuracy with a long 
tail distribution that spans from a few milliseconds to 10 minutes. 
   
   As an example, it ultimately does not matter if the true p99 is 550 seconds 
but the calculated p99 is 555 seconds.
   
   A big painpoint on datasketches is tuning and understanding K. Tuning K for 
queries for calculating a stable p9x is painful. Since QuantilesSketches return 
values are non-deterministic, the same query with the same underlying dataset 
can return values that change every time a query is fired. This K value can be 
bumped upwards to provide higher value stability, but at the cost of higher 
memory consumption (often causing heap-to-disk events) and return much slower.
   
   Furthermore, the `sketches-java` library that Datadog released has different 
storage strategies to manage memory. It includes strategies for unbounded 
storage, storage for maintaining error guarantees at higher and lower bins.  
described in the comments on this source: 
https://github.com/DataDog/sketches-java/blob/master/src/main/java/com/datadoghq/sketch/ddsketch/DDSketches.java
 
   
   This extension makes use of the `collapsingLowestDense` strategy to preserve 
error gurantees at the highest quantiles.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to