jon-wei commented on a change in pull request #6638: Fixed buckets histogram 
aggregator
URL: https://github.com/apache/incubator-druid/pull/6638#discussion_r244888826
 
 

 ##########
 File path: docs/content/development/extensions-core/approximate-histograms.md
 ##########
 @@ -91,17 +95,138 @@ query.
 |`numBuckets`             |Number of output buckets for the resulting 
histogram. Bucket intervals are dynamic, based on the range of the underlying 
data. Use a post-aggregator to have finer control over the bucketing scheme|7|
 |`lowerLimit`/`upperLimit`|Restrict the approximation to the given range. The 
values outside this range will be aggregated into two centroids. Counts of 
values outside this range are still maintained. |-INF/+INF|
 
+## Fixed Buckets Histogram
+
+The fixed buckets histogram aggregator builds a histogram on a numeric column, 
with evenly-sized buckets across a specified value range. Values outside of the 
range are handled based on a user-specified outlier handling mode.
+
+This histogram supports the min/max/quantiles post-aggregators but does not 
support the bucketing post-aggregators.
+
+
+|Property                 |Description                   |Default              
             |
+|-------------------------|------------------------------|----------------------------------|
+|`type`|Type of the aggregator. Must `fixedBucketsHistogram`.|No default, must 
be specified|
+|`name`|Column name for the aggregator.|No default, must be specified|
+|`fieldName`|Column name of the input to the aggregator.|No default, must be 
specified|
+|`lowerLimit`|Lower limit of the histogram. |No default, must be specified|
+|`upperLimit`|Upper limit of the histogram. |No default, must be specified|
+|`numBuckets`|Number of buckets for the histogram. The range [lowerLimit, 
upperLimit] will be divided into `numBuckets` intervals of equal size.|10|
+|`outlierHandlingMode`|Specifies how values outside of [lowerLimit, 
upperLimit] will be handled. Supported modes are "ignore", "overflow", and 
"clip". See [outlier handling modes](#outlier-handling-modes) for more 
details.|No default, must be specified|
+
+An example aggregator spec is shown below:
+
+```json
+{
+  "type" : "fixedBucketsHistogram",
+  "name" : <output_name>,
+  "fieldName" : <metric_name>,
+  "numBuckets" : <integer>,
+  "lowerLimit" : <double>,
+  "upperLimit" : <double>,
+  "outlierHandlingMode": <mode>
+}
+```
+
+### Outlier handling modes
+
+The outlier handling mode specifies what should be done with values outside of 
the histogram's range. There are three supported modes:
+
+`ignore`: Throw away outlier values.
+`overflow`: A count of outlier values will be tracked by the histogram, 
available in the `lowerOutlierCount` and `upperOutlierCount` fields.
+`clip`: Outlier values will be clipped to the `lowerLimit` or the `upperLimit` 
and included in the histogram.
+
+### Output fields
+
+The histogram aggregator's output object has the following fields:
+
+`lowerLimit`: Lower limit of the histogram
+`upperLimit`: Upper limit of the histogram
+`numBuckets`: Number of histogram buckets
+`outlierHandlingMode`: Outlier handling mode
+`count`: Total number of values contained in the histgram, excluding outliers
+`lowerOutlierCount`: Count of outlier values below `lowerLimit`. Only used if 
the outlier mode is `overflow`.
+`upperOutlierCount`: Count of outlier values above `upperLimit`. Only used if 
the outlier mode is `overflow`.
+`missingValueCount`: Count of null values seen by the histogram.
+`max`: Max value seen by the histogram. This does not include outlier values.
+`min`: Min value seen by the histogram. This does not include outlier values.
+`histogram`: An array of longs with size `numBuckets`, containing the bucket 
counts
+
+### Serialization formats
 
-### Approximate Histogram post-aggregators
+#### Full serialization format
+
+This format includes the full histogram bucket count array in the 
serialization format.
+
+```
+byte: serialization version, must be 0x01
+byte: encoding mode, 0x01 for full
+double: lowerLimit
+double: upperLimit
+int: numBuckets
+byte: outlier handling mode (0x00 for `ignore`, 0x01 for `overflow`, and 0x02 
for `clip`)
+long: count, total number of values contained in the histogram, excluding 
outliers
+long: lowerOutlierCount
+long: upperOutlierCount
+long: missingValueCount
+double: max
+double: min
+array of longs: bucket counts for the histogram
+```
+
+#### Sparse serialization format
+
+This format represents the histogram bucket counts as (bucketNum, count) 
pairs. This serialization format is used when less than half of the histogram's 
buckets have values.
+
+```
+byte: serialization version, must be 0x01
+byte: encoding mode, 0x02 for sparse
+double: lowerLimit
+double: upperLimit
+int: numBuckets
+byte: outlier handling mode (0x00 for `ignore`, 0x01 for `overflow`, and 0x02 
for `clip`)
+long: count, total number of values contained in the histogram, excluding 
outliers
+long: lowerOutlierCount
+long: upperOutlierCount
+long: missingValueCount
+double: max
+double: min
+int: number of following (bucketNum, count) pairs
+sequence of (int, long) pairs:
+  int: bucket number
+  count: bucket count
+```
+
+### Ingesting existing histograms 
+
+It is also possible to ingest existing fixed buckets histograms. The input 
must be a Base64 string encoding a byte array that contains a serialized 
histogram object. Both "full" and "sparse" formats can be used.
 
 Review comment:
   Reordered this

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to