[GitHub] quenlang edited a comment on issue #6853: horrible histogram post-aggregation

GitBox Wed, 16 Jan 2019 01:40:02 -0800

quenlang edited a comment on issue #6853: horrible histogram post-aggregation 
URL: 
https://github.com/apache/incubator-druid/issues/6853#issuecomment-454710643
 
 
   @jon-wei As the data ingested into druid which was stayed at the memory or 
local disk of middleManager peon task before handing off to deep storage, so i 
performed a query which will be sent to peon task. The query like this:
   ```
   {
     "queryType": "timeseries",
     "dataSource": {
       "type": "table",
       "name": "sketch_1"
     },
     "intervals": {
       "type": "intervals",
       "intervals": [
         "2019-01-16T00:30:00/2019-01-17T00:00:00"
       ]
     },
     "descending": false,
     "virtualColumns": [],
     "granularity": {
       "type": "all"
     },
     "aggregations": [
       {
         "type": "longSum",
         "name": "count_total",
         "fieldName": "count",
         "expression": null
       },
       {
         "type": "approxHistogramFold",
         "name": "dns_time_histogram",
         "fieldName": "dns_time_histogram",
         "resolution": 50,
         "numBuckets": 7,
         "lowerLimit": 0
       },                                                                       
                                       [13/1830]
       {
         "type": "quantilesDoublesSketch",
         "name": "dns_time_sketch",
         "fieldName": "dns_time_sketch",
         "k": 128
       }
     ],
     "postAggregations": [
       {
         "type": "customBuckets",
         "name": "performance_histogram",
         "fieldName": "dns_time_histogram",
         "breaks": [0, 200.0, 400.0, 600.0, 800.0, 1000.0, 1200.0, 1400.0, 
"Infinity"]
       },
       {
         "type" : "quantiles",
         "name" : "histogram_quantile",
         "fieldName" : "dns_time_histogram",
         "probabilities" : [0.50, 0.75, 0.90, 0.95]
       },
       {
         "type"  : "quantilesDoublesSketchToHistogram",
         "name": "performance_sketch",
         "field": {
           "type": "fieldAccess",
           "fieldName": "dns_time_sketch"
         },
         "splitPoints" : [200, 400, 600, 800, 1000, 1200, 1400]
       },
       {
         "type"  : "quantilesDoublesSketchToQuantiles",
         "name": "sketch_quantile",
         "field": {
           "type": "fieldAccess",
           "fieldName": "dns_time_sketch"
         },
         "fractions" : [0.50, 0.75, 0.90, 0.95]
       }
     ],
     "context": {
       "skipEmptyBuckets": "true"
     }
   }
   ``` 
   In order to compare the accuracy of approximate histogram and quantiles 
sketch so i defined two aggregation methods in the same query. And the result 
as below:
   ```
   [ {
     "timestamp" : "2019-01-16T02:01:00.000Z",
     "result" : {
       "performance_histogram" : {
         "breaks" : [ 0.0, 200.0, 400.0, 600.0, 800.0, 1000.0, 1200.0, 1400.0, 
"Infinity" ],
         "counts" : [ 5.0, 0.0, 1.0, 0.0, 0.0, 0.0, 0.0, 3.0 ]
       },
       "histogram_quantile" : {
         "probabilities" : [ 0.5, 0.75, 0.9, 0.95 ],
         "quantiles" : [ 99.0, 1150.0, 1772.0, 1886.0 ],
         "min" : 0.0,
         "max" : 2000.0
       },
       "count_total" : 10,
       "dns_time_histogram" : {
         "breaks" : [ -333.3333435058594, 0.0, 333.3333435058594, 
666.6666870117188, 1000.0, 1333.3333740234375, 1666.666748046875, 2000.0 ],
         "counts" : [ 1.0, 5.0, 1.0, 0.0, 0.0, 0.0, 3.0 ]
       },
       "sketch_quantile" : [ 100.0, 1700.0, 2000.0, 2000.0 ],
       "dns_time_sketch" : 10,
       "performance_sketch" : [ 6.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 3.0 ]
     }
   } ]
   ```
   The orignal row data which i wrote to kafka, include 11 rows as below:
   ```
   {"timestamp":1547604098000,"uid":1,"name":"zjk","dns_time":0.0,"count":1}
   {"timestamp":1547604098000,"uid":1,"name":"zjk","dns_time":99.0,"count":1}
   {"timestamp":1547604158000,"uid":2,"name":"quen","dns_time":100.0,"count":1}
   {"timestamp":1547604158000,"uid":2,"name":"quen","dns_time":600.0,"count":1}
   {"timestamp":1547604158000,"uid":3,"name":"quen","dns_time":2000.0,"count":1}
   {"timestamp":1547604218000,"uid":4,"name":"zjk","dns_time":5.0,"count":1}
   {"timestamp":1547604218000,"uid":5,"name":"zjk","dns_time":20.0,"count":1}
   {"timestamp":1547604218000,"uid":6,"name":"zjk","dns_time":2.0,"count":1}
   {"timestamp":1547604218000,"uid":7,"name":"zjk","dns_time":1772.0,"count":1}
   {"timestamp":1547604218000,"uid":8,"name":"zjk","dns_time":1700.0,"count":1}
   {"timestamp":1547604218000,"uid":9,"name":"zjk","dns_time":,300.0"count":1n}
   ```
   Also i calculated the quantiles of [0.50, 0.75,  0.90, 0.95] and the 
histograms of  [ 0.0, 200.0, 400.0, 600.0, 800.0, 1000.0, 1200.0, 1400.0, 
"Infinity" ] by myself. They were [100, 1150, 1772, 1886] and [ 6.0, 0.0, 0.0, 
1.0, 0.0, 0.0, 0.0, 3.0 ].  
   
   Compared the actual result with the query result, i found the quantile query 
of approximate histogram was more accurate than quantiles sketch, but the 
histogram query of quantiles sketch was win. 
   
   Can you tell me more about why the the quantile query of approximate 
histogram was more accurate thanquantiles sketch?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] quenlang edited a comment on issue #6853: horrible histogram post-aggregation

Reply via email to