quenlang edited a comment on issue #6853: horrible histogram post-aggregation URL: https://github.com/apache/incubator-druid/issues/6853#issuecomment-454710643 @jon-wei As the data ingested into druid which was stayed at the memory or local disk of middleManager peon task before handing off to deep storage, so i performed a query which will be sent to peon task. The query like this: ``` { "queryType": "timeseries", "dataSource": { "type": "table", "name": "sketch_1" }, "intervals": { "type": "intervals", "intervals": [ "2019-01-16T00:30:00/2019-01-17T00:00:00" ] }, "descending": false, "virtualColumns": [], "granularity": { "type": "all" }, "aggregations": [ { "type": "longSum", "name": "count_total", "fieldName": "count", "expression": null }, { "type": "approxHistogramFold", "name": "dns_time_histogram", "fieldName": "dns_time_histogram", "resolution": 50, "numBuckets": 7, "lowerLimit": 0 }, [13/1830] { "type": "quantilesDoublesSketch", "name": "dns_time_sketch", "fieldName": "dns_time_sketch", "k": 128 } ], "postAggregations": [ { "type": "customBuckets", "name": "performance_histogram", "fieldName": "dns_time_histogram", "breaks": [0, 200.0, 400.0, 600.0, 800.0, 1000.0, 1200.0, 1400.0, "Infinity"] }, { "type" : "quantiles", "name" : "histogram_quantile", "fieldName" : "dns_time_histogram", "probabilities" : [0.50, 0.75, 0.90, 0.95] }, { "type" : "quantilesDoublesSketchToHistogram", "name": "performance_sketch", "field": { "type": "fieldAccess", "fieldName": "dns_time_sketch" }, "splitPoints" : [200, 400, 600, 800, 1000, 1200, 1400] }, { "type" : "quantilesDoublesSketchToQuantiles", "name": "sketch_quantile", "field": { "type": "fieldAccess", "fieldName": "dns_time_sketch" }, "fractions" : [0.50, 0.75, 0.90, 0.95] } ], "context": { "skipEmptyBuckets": "true" } } ``` In order to compare the accuracy of approximate histogram and quantiles sketch so i defined two aggregation methods in the same query. And the result as below: ``` [ { "timestamp" : "2019-01-16T02:01:00.000Z", "result" : { "performance_histogram" : { "breaks" : [ 0.0, 200.0, 400.0, 600.0, 800.0, 1000.0, 1200.0, 1400.0, "Infinity" ], "counts" : [ 5.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 3.0 ] }, "histogram_quantile" : { "probabilities" : [ 0.5, 0.75, 0.9, 0.95 ], "quantiles" : [ 99.0, 1150.0, 1772.0, 1886.0 ], "min" : 0.0, "max" : 2000.0 }, "count_total" : 10, "dns_time_histogram" : { "breaks" : [ -333.3333435058594, 0.0, 333.3333435058594, 666.6666870117188, 1000.0, 1333.3333740234375, 1666.666748046875, 2000.0 ], "counts" : [ 1.0, 5.0, 1.0, 0.0, 0.0, 0.0, 3.0 ] }, "sketch_quantile" : [ 100.0, 1700.0, 2000.0, 2000.0 ], "dns_time_sketch" : 10, "performance_sketch" : [ 6.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 3.0 ] } } ] ``` The orignal row data which i wrote to kafka, include 11 rows as below: ``` {"timestamp":1547604098000,"uid":1,"name":"zjk","dns_time":0.0,"count":1} {"timestamp":1547604098000,"uid":1,"name":"zjk","dns_time":99.0,"count":1} {"timestamp":1547604158000,"uid":2,"name":"quen","dns_time":100.0,"count":1} {"timestamp":1547604158000,"uid":2,"name":"quen","dns_time":600.0,"count":1} {"timestamp":1547604158000,"uid":3,"name":"quen","dns_time":2000.0,"count":1} {"timestamp":1547604218000,"uid":4,"name":"zjk","dns_time":5.0,"count":1} {"timestamp":1547604218000,"uid":5,"name":"zjk","dns_time":20.0,"count":1} {"timestamp":1547604218000,"uid":6,"name":"zjk","dns_time":2.0,"count":1} {"timestamp":1547604218000,"uid":7,"name":"zjk","dns_time":1772.0,"count":1} {"timestamp":1547604218000,"uid":8,"name":"zjk","dns_time":1700.0,"count":1} {"timestamp":1547604218000,"uid":9,"name":"zjk","dns_time":,300.0"count":1n} ``` Also i calculated the quantiles of [0.50, 0.75, 0.90, 0.95] and the histograms of [ 0.0, 200.0, 400.0, 600.0, 800.0, 1000.0, 1200.0, 1400.0, "Infinity" ] by myself. They were [100, 1150, 1772, 1886] and [ 6.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 3.0 ]. Compared the actual result with the query result, i found the quantile query of approximate histogram was more accurate than quantiles sketch, but for the histogram query, quantiles sketch was win. Can you tell me more about why the the quantile query of approximate histogram was more accurate thanquantiles sketch?
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
