hfukada commented on PR #15049: URL: https://github.com/apache/druid/pull/15049#issuecomment-1739319642
> Nice. I had taken a cursory look at this paper a while back. But it wasn't clear to me if DDSketche has any advantage over https://datasketches.apache.org/docs/Quantiles/QuantilesSketchOverview.html. We already use data sketches library in druid. Have you looked at this library? We have been trying to make QuantilesSketches, specifically the `quantilesDoublesSketch` for over a year. The QuantilesSketches provided out of the box perform well for ranks in the middle as shown in this chart etc  This is due to the distributions around the p25-p75 are far more forgiving when operating with rank error (the error guarantee that QuantilesSketches provide). p50 values do not differ much from p51. but as the DDSketch paper points out, when the main operation is capturing the p9x values, it is not as accurate when using sketches that operate on rank error. a p98 can vary wildly from p99 in a long-tail distribution. This is where DDSketch comes in, it provides a relative-to-actual-value error guarantee (uniform error). We are happy with its performance and accuracy with a long tail distribution that spans from a few milliseconds to 10 minutes. As an example, it ultimately does not matter if the true p99 is 550 seconds but the calculated p99 is 555 seconds. A big painpoint on datasketches is tuning and understanding K. Tuning K for queries for calculating a stable p9x is painful. Since QuantilesSketches return values are non-deterministic, the same query with the same underlying dataset can return values that change every time a query is fired. This K value can be bumped upwards to provide higher value stability, but at the cost of higher memory consumption (often causing heap-to-disk events) and return much slower. Furthermore, the `sketches-java` library that Datadog released has different storage strategies to manage memory. It includes strategies for unbounded storage, storage for maintaining error guarantees at higher and lower bins. described in the comments on this source: https://github.com/DataDog/sketches-java/blob/master/src/main/java/com/datadoghq/sketch/ddsketch/DDSketches.java This extension makes use of the `collapsingLowestDense` strategy to preserve error gurantees at the highest quantiles. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
