Re: [PR] Add SpectatorHistogram extension (druid)

via GitHub Thu, 04 Jan 2024 12:20:33 -0800


suneet-s commented on code in PR #15340:
URL: https://github.com/apache/druid/pull/15340#discussion_r1440657269



##########
docs/development/extensions-contrib/spectator-histogram.md:
##########
@@ -0,0 +1,386 @@
+---
+id: spectator-histogram
+title: "Spectator Histogram module"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Summary
+This module provides Apache Druid approximate histogram aggregators and 
percentile
+post-aggregators based on Spectator fixed-bucket histograms.
+
+Consider using this extension if you need percentile approximations and:
+* want fast and accurate queries
+* at a lower storage cost
+* and have a large dataset
+* using only positive measurements
+
+> The main benefit of this extension over data-sketches is the reduced storage
+footprint. Which leads to smaller segment sizes, faster loading from deep 
storage
+and lower memory usage.
+
+In the Druid instance shown below, the example Wikipedia dataset is loaded 3 
times.
+* As-is, no rollup applied
+* With a single extra metric column of type `spectatorHistogram` ingesting the 
`added` column
+* With a single extra metric column of type `quantilesDoublesSketch` ingesting 
the `added` column
+
+Spectator histograms average just 6 extra bytes per row, while the data-sketch
+adds 48 bytes per row. This is an 8 x reduction in additional storage size.
+![Comparison of datasource sizes in web 
console](../../assets/spectator-histogram-size-comparison.png)
+
+As rollup improves, so does the size saving. For example, ingesting the 
wikipedia data
+with day-grain query granularity and removing all dimensions except 
`countryName`,
+we get to a segment that has just 106 rows. The base segment is 87 bytes per 
row,
+adding a single `spectatorHistogram` column adds just 27 bytes per row on 
average vs
+`quantilesDoublesSketch` adding 255 bytes per row. This is a 9.4 x reduction 
in additional storage size.
+Storage gains will differ per dataset depending on the variance and rollup of 
the data.
+
+## Background
+[Spectator](https://netflix.github.io/atlas-docs/spectator/) is a simple 
library
+for instrumenting code to record dimensional time series data.
+It was built, primarily, to work with 
[Atlas](https://netflix.github.io/atlas-docs/).
+Atlas was developed by Netflix to manage dimensional time series data for near
+real-time operational insight.
+
+With the 
[Atlas-Druid](https://github.com/Netflix-Skunkworks/iep-apps/tree/main/atlas-druid)
+service, it's possible to use the power of Atlas queries, backed by Druid as a
+data store to benefit from high-dimensionality and high-cardinality data.
+
+SpectatorHistogram is designed for efficient parallel aggregations while still
+allowing for filtering and grouping by dimensions. 
+It provides similar functionality to the built-in data-sketch aggregator, but 
is
+opinionated and optimized for typical measurements of cloud services and 
web-apps.
+Measurements such as page load time, transferred bytes, response time, request 
latency, etc.
+Through some trade-offs we're able to provide a significantly more compact
+representation with the same aggregation performance and accuracy as
+data-sketches (depending on data-set, see limitations below).
+
+## Limitations

Review Comment:
   Can you add a limitation here that it is not yet possible to use this via 
SQL.



##########
docs/development/extensions-contrib/spectator-histogram.md:
##########
@@ -0,0 +1,386 @@
+---
+id: spectator-histogram
+title: "Spectator Histogram module"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Summary
+This module provides Apache Druid approximate histogram aggregators and 
percentile
+post-aggregators based on Spectator fixed-bucket histograms.
+
+Consider using this extension if you need percentile approximations and:
+* want fast and accurate queries
+* at a lower storage cost
+* and have a large dataset
+* using only positive measurements
+
+> The main benefit of this extension over data-sketches is the reduced storage
+footprint. Which leads to smaller segment sizes, faster loading from deep 
storage
+and lower memory usage.
+
+In the Druid instance shown below, the example Wikipedia dataset is loaded 3 
times.
+* As-is, no rollup applied
+* With a single extra metric column of type `spectatorHistogram` ingesting the 
`added` column
+* With a single extra metric column of type `quantilesDoublesSketch` ingesting 
the `added` column
+
+Spectator histograms average just 6 extra bytes per row, while the data-sketch
+adds 48 bytes per row. This is an 8 x reduction in additional storage size.
+![Comparison of datasource sizes in web 
console](../../assets/spectator-histogram-size-comparison.png)
+
+As rollup improves, so does the size saving. For example, ingesting the 
wikipedia data
+with day-grain query granularity and removing all dimensions except 
`countryName`,
+we get to a segment that has just 106 rows. The base segment is 87 bytes per 
row,
+adding a single `spectatorHistogram` column adds just 27 bytes per row on 
average vs
+`quantilesDoublesSketch` adding 255 bytes per row. This is a 9.4 x reduction 
in additional storage size.
+Storage gains will differ per dataset depending on the variance and rollup of 
the data.
+
+## Background
+[Spectator](https://netflix.github.io/atlas-docs/spectator/) is a simple 
library
+for instrumenting code to record dimensional time series data.
+It was built, primarily, to work with 
[Atlas](https://netflix.github.io/atlas-docs/).
+Atlas was developed by Netflix to manage dimensional time series data for near
+real-time operational insight.
+
+With the 
[Atlas-Druid](https://github.com/Netflix-Skunkworks/iep-apps/tree/main/atlas-druid)
+service, it's possible to use the power of Atlas queries, backed by Druid as a
+data store to benefit from high-dimensionality and high-cardinality data.
+
+SpectatorHistogram is designed for efficient parallel aggregations while still
+allowing for filtering and grouping by dimensions. 
+It provides similar functionality to the built-in data-sketch aggregator, but 
is
+opinionated and optimized for typical measurements of cloud services and 
web-apps.
+Measurements such as page load time, transferred bytes, response time, request 
latency, etc.
+Through some trade-offs we're able to provide a significantly more compact
+representation with the same aggregation performance and accuracy as
+data-sketches (depending on data-set, see limitations below).
+
+## Limitations
+* Supports positive numeric values within the range of [0, 2^53). Negatives are
+coerced to 0.
+* Fixed buckets with increasing bucket widths. Relative accuracy is maintained,
+but absolute accuracy reduces with larger values.
+
+> If either of these limitations are a problem, then the data-sketch aggregator
+is most likely a better choice.
+
+## Functionality
+The SpectatorHistogram aggregator is capable of generating histograms from raw 
numeric
+values as well as aggregating/combining pre-aggregated histograms generated 
using
+the SpectatorHistogram aggregator itself.
+While you can generate histograms on the fly at query time, it is generally 
more
+performant to generate histograms during ingestion and then combine them at
+query time. This is especially true where rollup is enabled. It may be 
misleading or 
+incorrect to generate histogram from already rolled-up summed data.
+
+The module provides postAggregators, `percentileSpectatorHistogram` (singular) 
and
+`percentilesSpectatorHistogram` (plural), that can be used to compute 
approximate 
+percentiles from histograms generated by the SpectatorHistogram aggregator.
+Again, these postAggregators can be used to compute percentiles from raw 
numeric
+values via the SpectatorHistogram aggregator or from pre-aggregated histograms.
+
+> If you're only using the aggregator to compute percentiles from raw numeric 
values,
+then you can use the built-in data-sketch aggregator instead. The performance
+and accuracy are comparable, the data-sketch aggregator supports negative 
values,
+and you don't need to load an additional extension.
+ 
+An aggregated SpectatorHistogram can also be queried using a `longSum` or 
`doubleSum`
+aggregator to retrieve the population of the histogram. This is effectively 
the count
+of the number of values that were aggregated into the histogram. This 
flexibility can
+avoid the need to maintain a separate metric for the count of values.
+
+For high-frequency measurements, you may need to pre-aggregate data at the 
client prior
+to sending into Druid. For example, if you're measuring individual image 
render times
+on an image-heavy website, you may want to aggregate the render times for a 
page-view
+into a single histogram prior to sending to Druid in real-time. This can 
reduce the
+amount of data that's needed to send from the client across the wire.
+
+SpectatorHistogram supports ingesting pre-aggregated histograms in real-time 
and batch.
+They can be sent as a JSON map, keyed by the spectator bucket ID and the value 
is the
+count of values. This is the same format as the serialized JSON representation 
of the
+histogram. The keys need not be ordered or contiguous e.g.
+
+```json
+{ "4":  8, "5": 15, "6": 37, "7": 9, "8": 3, "10": 1, "13": 1 }
+```
+
+## Loading the extension
+To use SpectatorHistogram, make sure you 
[include](../../configuration/extensions.md#loading-extensions) the extension 
in your config file:
+
+```
+druid.extensions.loadList=["druid-spectator-histogram"]
+```
+
+## Aggregators
+
+The result of the aggregation is a histogram that is built by ingesting 
numeric values from
+the raw data, or from combining pre-aggregated histograms. The result is 
represented in 
+JSON format where the keys are the bucket index and the values are the count 
of entries
+in that bucket.
+
+The buckets are defined as per the Spectator 
[PercentileBuckets](https://github.com/Netflix/spectator/blob/main/spectator-api/src/main/java/com/netflix/spectator/api/histogram/PercentileBuckets.java)
 specification.
+See [Appendix](#histogram-bucket-boundaries) for the full list of bucket 
boundaries.
+```js
+  // The set of buckets is generated by using powers of 4 and incrementing by 
one-third of the
+  // previous power of 4 in between as long as the value is less than the next 
power of 4 minus
+  // the delta.
+  //
+  // Base: 1, 2, 3
+  //
+  // 4 (4^1), delta = 1 (~1/3 of 4)
+  //     5, 6, 7, ..., 14,
+  //
+  // 16 (4^2), delta = 5 (~1/3 of 16)
+  //    21, 26, 31, ..., 56,
+  //
+  // 64 (4^3), delta = 21 (~1/3 of 64)
+  // ...
+```
+
+There are multiple aggregator types included, all of which are based on the 
same
+underlying implementation. The different types signal to the Atlas-Druid 
service (if using)
+how to handle the resulting data from a query.
+
+* spectatorHistogramTimer signals that the histogram is representing
+a collection of timer values. It is recommended to normalize timer values to 
nanoseconds
+at, or prior to, ingestion. If queried via the Atlas-Druid service, it will
+normalize timers to second resolution at query time as a more natural unit of 
time
+for human consumption.
+* spectatorHistogram and spectatorHistogramDistribution are generic histograms 
that
+can be used to represent any measured value without units. No normalization is
+required or performed.
+
+### `spectatorHistogram` aggregator
+Alias: `spectatorHistogramDistribution`, `spectatorHistogramTimer`
+
+To aggregate at query time:
+```
+{
+  "type" : "spectatorHistogram",
+  "name" : <output_name>,
+  "fieldName" : <column_name>
+ }
+```
+
+| Property  | Description                                                      
                                            | Required? |
+|-----------|--------------------------------------------------------------------------------------------------------------|-----------|
+| type      | This String must be one of "spectatorHistogram", 
"spectatorHistogramTimer", "spectatorHistogramDistribution" | yes       |
+| name      | A String for the output (result) name of the aggregation.        
                                            | yes       |
+| fieldName | A String for the name of the input field containing raw numeric 
values or pre-aggregated histograms.         | yes       |
+
+### `longSum`, `doubleSum` and `floatSum` aggregators
+To get the population size (count of events contributing to the histogram):
+```
+{
+  "type" : "longSum",
+  "name" : <output_name>,
+  "fieldName" : <column_name_of_aggregated_histogram>
+ }
+```
+
+| Property  | Description                                                      
              | Required? |
+|-----------|--------------------------------------------------------------------------------|-----------|
+| type      | Must be "longSum", "doubleSum", or "floatSum".                   
              | yes       |
+| name      | A String for the output (result) name of the aggregation.        
              | yes       |
+| fieldName | A String for the name of the input field containing 
pre-aggregated histograms. | yes       |
+
+## Post Aggregators
+
+### Percentile (singular)
+This returns a single percentile calculation based on the distribution of the 
values in the aggregated histogram.
+
+```
+{
+  "type": "percentileSpectatorHistogram",
+  "name": <output name>,
+  "field": {
+    "type": "fieldAccess",
+    "fieldName": <name of aggregated SpectatorHistogram>
+  },
+  "percentile": <decimal percentile, e.g. 50.0 for median>
+}
+```
+
+| Property   | Description                                                 | 
Required? |
+|------------|-------------------------------------------------------------|-----------|
+| type       | This String should always be "percentileSpectatorHistogram" | 
yes       |
+| name       | A String for the output (result) name of the calculation.   | 
yes       |
+| field      | A field reference pointing to the aggregated histogram.     | 
yes       |
+| percentile | A single decimal percentile between 0.0 and 100.0           | 
yes       |
+
+### Percentiles (multiple)
+This returns an array of percentiles corresponding to those requested.
+
+```
+{
+  "type": "percentilesSpectatorHistogram",
+  "name": <output name>,
+  "field": {
+    "type": "fieldAccess",
+    "fieldName": <name of aggregated SpectatorHistogram>
+  },
+  "percentiles": [25, 50, 75, 99.5]
+}
+```
+
+> Note: It's more efficient to request multiple percentiles in a single query
+than to request individual percentiles in separate queries. This array-based
+helper is provided for convenience and has a marginal performance benefit over
+using the singular percentile post-aggregator multiple times within a query.
+The more expensive part of the query is the aggregation of the histogram.
+The post-aggregation calculations all happen on the same aggregated histogram.
+
+Results will contain arrays matching the length and order of the requested
+array of percentiles.
+
+```
+"percentilesAdded": [
+    0.5504911679884643, // 25th percentile
+    4.013975155279504,  // 50th percentile 
+    78.89518317503394,  // 75th percentile
+    8580.024999999994   // 99.5th percentile
+]
+```
+
+| Property    | Description                                                  | 
Required? |
+|-------------|--------------------------------------------------------------|-----------|
+| type        | This String should always be "percentilesSpectatorHistogram" | 
yes       |
+| name        | A String for the output (result) name of the calculation.    | 
yes       |
+| field       | A field reference pointing to the aggregated histogram.      | 
yes       |
+| percentiles | Non-empty array of decimal percentiles between 0.0 and 100.0 | 
yes       |
+
+## Appendix
+

Review Comment:
   An example spec of how to ingest the wikipedia dataset with this the 
spectator histogram would be helpful here.



##########
docs/development/extensions-contrib/spectator-histogram.md:
##########
@@ -0,0 +1,386 @@
+---
+id: spectator-histogram
+title: "Spectator Histogram module"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Summary
+This module provides Apache Druid approximate histogram aggregators and 
percentile
+post-aggregators based on Spectator fixed-bucket histograms.
+
+Consider using this extension if you need percentile approximations and:
+* want fast and accurate queries
+* at a lower storage cost
+* and have a large dataset
+* using only positive measurements
+
+> The main benefit of this extension over data-sketches is the reduced storage
+footprint. Which leads to smaller segment sizes, faster loading from deep 
storage
+and lower memory usage.
+
+In the Druid instance shown below, the example Wikipedia dataset is loaded 3 
times.
+* As-is, no rollup applied
+* With a single extra metric column of type `spectatorHistogram` ingesting the 
`added` column
+* With a single extra metric column of type `quantilesDoublesSketch` ingesting 
the `added` column
+
+Spectator histograms average just 6 extra bytes per row, while the data-sketch
+adds 48 bytes per row. This is an 8 x reduction in additional storage size.
+![Comparison of datasource sizes in web 
console](../../assets/spectator-histogram-size-comparison.png)
+
+As rollup improves, so does the size saving. For example, ingesting the 
wikipedia data
+with day-grain query granularity and removing all dimensions except 
`countryName`,
+we get to a segment that has just 106 rows. The base segment is 87 bytes per 
row,
+adding a single `spectatorHistogram` column adds just 27 bytes per row on 
average vs
+`quantilesDoublesSketch` adding 255 bytes per row. This is a 9.4 x reduction 
in additional storage size.
+Storage gains will differ per dataset depending on the variance and rollup of 
the data.
+
+## Background
+[Spectator](https://netflix.github.io/atlas-docs/spectator/) is a simple 
library
+for instrumenting code to record dimensional time series data.
+It was built, primarily, to work with 
[Atlas](https://netflix.github.io/atlas-docs/).
+Atlas was developed by Netflix to manage dimensional time series data for near
+real-time operational insight.
+
+With the 
[Atlas-Druid](https://github.com/Netflix-Skunkworks/iep-apps/tree/main/atlas-druid)
+service, it's possible to use the power of Atlas queries, backed by Druid as a
+data store to benefit from high-dimensionality and high-cardinality data.
+
+SpectatorHistogram is designed for efficient parallel aggregations while still
+allowing for filtering and grouping by dimensions. 
+It provides similar functionality to the built-in data-sketch aggregator, but 
is
+opinionated and optimized for typical measurements of cloud services and 
web-apps.
+Measurements such as page load time, transferred bytes, response time, request 
latency, etc.
+Through some trade-offs we're able to provide a significantly more compact
+representation with the same aggregation performance and accuracy as
+data-sketches (depending on data-set, see limitations below).
+
+## Limitations
+* Supports positive numeric values within the range of [0, 2^53). Negatives are

Review Comment:
   I think it would be good to call out that decimals are not supported - when 
I first read numeric values, I just assumed that decimals were supported, but 
the druid summit talk mentions those are not supported.



##########
docs/configuration/extensions.md:
##########
@@ -76,30 +76,31 @@ If you'd like to take on maintenance for a community 
extension, please post on [
 
 All of these community extensions can be downloaded using 
[pull-deps](../operations/pull-deps.md) while specifying a `-c` coordinate 
option to pull 
`org.apache.druid.extensions.contrib:{EXTENSION_NAME}:{DRUID_VERSION}`.
 
-|Name|Description|Docs|
-|----|-----------|----|
-|aliyun-oss-extensions|Aliyun OSS deep storage 
|[link](../development/extensions-contrib/aliyun-oss-extensions.md)|
-|ambari-metrics-emitter|Ambari Metrics Emitter 
|[link](../development/extensions-contrib/ambari-metrics-emitter.md)|
-|druid-cassandra-storage|Apache Cassandra deep 
storage.|[link](../development/extensions-contrib/cassandra.md)|
-|druid-cloudfiles-extensions|Rackspace Cloudfiles deep storage and 
firehose.|[link](../development/extensions-contrib/cloudfiles.md)|
-|druid-compressed-bigdecimal|Compressed Big Decimal Type | 
[link](../development/extensions-contrib/compressed-big-decimal.md)|
-|druid-distinctcount|DistinctCount 
aggregator|[link](../development/extensions-contrib/distinctcount.md)|
-|druid-redis-cache|A cache implementation for Druid based on 
Redis.|[link](../development/extensions-contrib/redis-cache.md)|
-|druid-time-min-max|Min/Max aggregator for 
timestamp.|[link](../development/extensions-contrib/time-min-max.md)|
-|sqlserver-metadata-storage|Microsoft SQLServer deep 
storage.|[link](../development/extensions-contrib/sqlserver.md)|
-|graphite-emitter|Graphite metrics 
emitter|[link](../development/extensions-contrib/graphite.md)|
-|statsd-emitter|StatsD metrics 
emitter|[link](../development/extensions-contrib/statsd.md)|
-|kafka-emitter|Kafka metrics 
emitter|[link](../development/extensions-contrib/kafka-emitter.md)|
-|druid-thrift-extensions|Support thrift ingestion 
|[link](../development/extensions-contrib/thrift.md)|
-|druid-opentsdb-emitter|OpenTSDB metrics emitter 
|[link](../development/extensions-contrib/opentsdb-emitter.md)|
-|materialized-view-selection, materialized-view-maintenance|Materialized 
View|[link](../development/extensions-contrib/materialized-view.md)|
-|druid-moving-average-query|Support for [Moving 
Average](https://en.wikipedia.org/wiki/Moving_average) and other Aggregate 
[Window 
Functions](https://en.wikibooks.org/wiki/Structured_Query_Language/Window_functions)
 in Druid 
queries.|[link](../development/extensions-contrib/moving-average-query.md)|
-|druid-influxdb-emitter|InfluxDB metrics 
emitter|[link](../development/extensions-contrib/influxdb-emitter.md)|
-|druid-momentsketch|Support for approximate quantile queries using the 
[momentsketch](https://github.com/stanford-futuredata/momentsketch) 
library|[link](../development/extensions-contrib/momentsketch-quantiles.md)|
-|druid-tdigestsketch|Support for approximate sketch aggregators based on 
[T-Digest](https://github.com/tdunning/t-digest)|[link](../development/extensions-contrib/tdigestsketch-quantiles.md)|
-|gce-extensions|GCE 
Extensions|[link](../development/extensions-contrib/gce-extensions.md)|
-|prometheus-emitter|Exposes [Druid metrics](../operations/metrics.md) for 
Prometheus server collection 
(https://prometheus.io/)|[link](../development/extensions-contrib/prometheus.md)|
-|kubernetes-overlord-extensions|Support for launching tasks in k8s without 
Middle Managers|[link](../development/extensions-contrib/k8s-jobs.md)|
+| Name                                                       | Description     
                                                                                
                                                                                
                              | Docs                                            
                     |
+|------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|----------------------------------------------------------------------|
+| aliyun-oss-extensions                                      | Aliyun OSS deep 
storage                                                                         
                                                                                
                              | 
[link](../development/extensions-contrib/aliyun-oss-extensions.md)   |
+| ambari-metrics-emitter                                     | Ambari Metrics 
Emitter                                                                         
                                                                                
                               | 
[link](../development/extensions-contrib/ambari-metrics-emitter.md)  |
+| druid-cassandra-storage                                    | Apache 
Cassandra deep storage.                                                         
                                                                                
                                       | 
[link](../development/extensions-contrib/cassandra.md)               |
+| druid-cloudfiles-extensions                                | Rackspace 
Cloudfiles deep storage and firehose.                                           
                                                                                
                                    | 
[link](../development/extensions-contrib/cloudfiles.md)              |
+| druid-compressed-bigdecimal                                | Compressed Big 
Decimal Type                                                                    
                                                                                
                               | 
[link](../development/extensions-contrib/compressed-big-decimal.md)  |
+| druid-distinctcount                                        | DistinctCount 
aggregator                                                                      
                                                                                
                                | 
[link](../development/extensions-contrib/distinctcount.md)           |
+| druid-redis-cache                                          | A cache 
implementation for Druid based on Redis.                                        
                                                                                
                                      | 
[link](../development/extensions-contrib/redis-cache.md)             |
+| druid-time-min-max                                         | Min/Max 
aggregator for timestamp.                                                       
                                                                                
                                      | 
[link](../development/extensions-contrib/time-min-max.md)            |
+| sqlserver-metadata-storage                                 | Microsoft 
SQLServer deep storage.                                                         
                                                                                
                                    | 
[link](../development/extensions-contrib/sqlserver.md)               |
+| graphite-emitter                                           | Graphite 
metrics emitter                                                                 
                                                                                
                                     | 
[link](../development/extensions-contrib/graphite.md)                |
+| statsd-emitter                                             | StatsD metrics 
emitter                                                                         
                                                                                
                               | 
[link](../development/extensions-contrib/statsd.md)                  |
+| kafka-emitter                                              | Kafka metrics 
emitter                                                                         
                                                                                
                                | 
[link](../development/extensions-contrib/kafka-emitter.md)           |
+| druid-thrift-extensions                                    | Support thrift 
ingestion                                                                       
                                                                                
                               | 
[link](../development/extensions-contrib/thrift.md)                  |
+| druid-opentsdb-emitter                                     | OpenTSDB 
metrics emitter                                                                 
                                                                                
                                     | 
[link](../development/extensions-contrib/opentsdb-emitter.md)        |
+| materialized-view-selection, materialized-view-maintenance | Materialized 
View                                                                            
                                                                                
                                 | 
[link](../development/extensions-contrib/materialized-view.md)       |
+| druid-moving-average-query                                 | Support for 
[Moving Average](https://en.wikipedia.org/wiki/Moving_average) and other 
Aggregate [Window 
Functions](https://en.wikibooks.org/wiki/Structured_Query_Language/Window_functions)
 in Druid queries. | 
[link](../development/extensions-contrib/moving-average-query.md)    |
+| druid-influxdb-emitter                                     | InfluxDB 
metrics emitter                                                                 
                                                                                
                                     | 
[link](../development/extensions-contrib/influxdb-emitter.md)        |
+| druid-momentsketch                                         | Support for 
approximate quantile queries using the 
[momentsketch](https://github.com/stanford-futuredata/momentsketch) library     
                                                                           | 
[link](../development/extensions-contrib/momentsketch-quantiles.md)  |
+| druid-tdigestsketch                                        | Support for 
approximate sketch aggregators based on 
[T-Digest](https://github.com/tdunning/t-digest)                                
                                                                          | 
[link](../development/extensions-contrib/tdigestsketch-quantiles.md) |
+| gce-extensions                                             | GCE Extensions  
                                                                                
                                                                                
                              | 
[link](../development/extensions-contrib/gce-extensions.md)          |
+| prometheus-emitter                                         | Exposes [Druid 
metrics](../operations/metrics.md) for Prometheus server collection 
(https://prometheus.io/)                                                        
                                           | 
[link](../development/extensions-contrib/prometheus.md)              |
+| kubernetes-overlord-extensions                             | Support for 
launching tasks in k8s without Middle Managers                                  
                                                                                
                                  | 
[link](../development/extensions-contrib/k8s-jobs.md)                |
+| druid-spectator-histogram                                  | Support for 
efficient approximate percentile queries                                        
                                                                                
                                  | 
[link](../development/extensions-contrib/spectator-histogram.md)     |

Review Comment:
   Can you update your editor to undo the formatting changes to this table 
please.
   
   
https://github.com/apache/druid/blob/master/dev/druid_intellij_formatting.xml#L77-L80
 - This was recently added to the druid_intellij_formatting.xml file- so if you 
re-import it, the formatter should no longer update the tables when you edit 
them.



##########
docs/development/extensions-contrib/spectator-histogram.md:
##########
@@ -0,0 +1,386 @@
+---
+id: spectator-histogram
+title: "Spectator Histogram module"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Summary
+This module provides Apache Druid approximate histogram aggregators and 
percentile
+post-aggregators based on Spectator fixed-bucket histograms.
+
+Consider using this extension if you need percentile approximations and:
+* want fast and accurate queries
+* at a lower storage cost
+* and have a large dataset
+* using only positive measurements
+
+> The main benefit of this extension over data-sketches is the reduced storage
+footprint. Which leads to smaller segment sizes, faster loading from deep 
storage
+and lower memory usage.
+
+In the Druid instance shown below, the example Wikipedia dataset is loaded 3 
times.
+* As-is, no rollup applied
+* With a single extra metric column of type `spectatorHistogram` ingesting the 
`added` column
+* With a single extra metric column of type `quantilesDoublesSketch` ingesting 
the `added` column
+
+Spectator histograms average just 6 extra bytes per row, while the data-sketch
+adds 48 bytes per row. This is an 8 x reduction in additional storage size.
+![Comparison of datasource sizes in web 
console](../../assets/spectator-histogram-size-comparison.png)
+
+As rollup improves, so does the size saving. For example, ingesting the 
wikipedia data
+with day-grain query granularity and removing all dimensions except 
`countryName`,
+we get to a segment that has just 106 rows. The base segment is 87 bytes per 
row,
+adding a single `spectatorHistogram` column adds just 27 bytes per row on 
average vs
+`quantilesDoublesSketch` adding 255 bytes per row. This is a 9.4 x reduction 
in additional storage size.
+Storage gains will differ per dataset depending on the variance and rollup of 
the data.
+
+## Background
+[Spectator](https://netflix.github.io/atlas-docs/spectator/) is a simple 
library
+for instrumenting code to record dimensional time series data.
+It was built, primarily, to work with 
[Atlas](https://netflix.github.io/atlas-docs/).
+Atlas was developed by Netflix to manage dimensional time series data for near
+real-time operational insight.
+
+With the 
[Atlas-Druid](https://github.com/Netflix-Skunkworks/iep-apps/tree/main/atlas-druid)
+service, it's possible to use the power of Atlas queries, backed by Druid as a
+data store to benefit from high-dimensionality and high-cardinality data.
+
+SpectatorHistogram is designed for efficient parallel aggregations while still
+allowing for filtering and grouping by dimensions. 
+It provides similar functionality to the built-in data-sketch aggregator, but 
is
+opinionated and optimized for typical measurements of cloud services and 
web-apps.
+Measurements such as page load time, transferred bytes, response time, request 
latency, etc.
+Through some trade-offs we're able to provide a significantly more compact
+representation with the same aggregation performance and accuracy as
+data-sketches (depending on data-set, see limitations below).
+
+## Limitations
+* Supports positive numeric values within the range of [0, 2^53). Negatives are
+coerced to 0.
+* Fixed buckets with increasing bucket widths. Relative accuracy is maintained,
+but absolute accuracy reduces with larger values.

Review Comment:
   Can you explain the accuracy tradeoff here vs other sketch implementations. 
   
   I don't understand what absolute accuracy reduces with larger values means. 
Maybe an example in the docs will help clear it up.
   
   I think that sort of information will be helpful for users to decide which 
sketch implementation to use for their use case.



##########
docs/development/extensions-contrib/spectator-histogram.md:
##########
@@ -0,0 +1,386 @@
+---
+id: spectator-histogram
+title: "Spectator Histogram module"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Summary
+This module provides Apache Druid approximate histogram aggregators and 
percentile
+post-aggregators based on Spectator fixed-bucket histograms.
+
+Consider using this extension if you need percentile approximations and:
+* want fast and accurate queries
+* at a lower storage cost
+* and have a large dataset
+* using only positive measurements
+
+> The main benefit of this extension over data-sketches is the reduced storage
+footprint. Which leads to smaller segment sizes, faster loading from deep 
storage
+and lower memory usage.
+
+In the Druid instance shown below, the example Wikipedia dataset is loaded 3 
times.
+* As-is, no rollup applied
+* With a single extra metric column of type `spectatorHistogram` ingesting the 
`added` column
+* With a single extra metric column of type `quantilesDoublesSketch` ingesting 
the `added` column
+
+Spectator histograms average just 6 extra bytes per row, while the data-sketch
+adds 48 bytes per row. This is an 8 x reduction in additional storage size.
+![Comparison of datasource sizes in web 
console](../../assets/spectator-histogram-size-comparison.png)
+
+As rollup improves, so does the size saving. For example, ingesting the 
wikipedia data
+with day-grain query granularity and removing all dimensions except 
`countryName`,
+we get to a segment that has just 106 rows. The base segment is 87 bytes per 
row,
+adding a single `spectatorHistogram` column adds just 27 bytes per row on 
average vs
+`quantilesDoublesSketch` adding 255 bytes per row. This is a 9.4 x reduction 
in additional storage size.
+Storage gains will differ per dataset depending on the variance and rollup of 
the data.
+
+## Background
+[Spectator](https://netflix.github.io/atlas-docs/spectator/) is a simple 
library
+for instrumenting code to record dimensional time series data.
+It was built, primarily, to work with 
[Atlas](https://netflix.github.io/atlas-docs/).
+Atlas was developed by Netflix to manage dimensional time series data for near
+real-time operational insight.
+
+With the 
[Atlas-Druid](https://github.com/Netflix-Skunkworks/iep-apps/tree/main/atlas-druid)
+service, it's possible to use the power of Atlas queries, backed by Druid as a
+data store to benefit from high-dimensionality and high-cardinality data.
+
+SpectatorHistogram is designed for efficient parallel aggregations while still
+allowing for filtering and grouping by dimensions. 
+It provides similar functionality to the built-in data-sketch aggregator, but 
is
+opinionated and optimized for typical measurements of cloud services and 
web-apps.
+Measurements such as page load time, transferred bytes, response time, request 
latency, etc.
+Through some trade-offs we're able to provide a significantly more compact
+representation with the same aggregation performance and accuracy as
+data-sketches (depending on data-set, see limitations below).
+
+## Limitations
+* Supports positive numeric values within the range of [0, 2^53). Negatives are
+coerced to 0.
+* Fixed buckets with increasing bucket widths. Relative accuracy is maintained,
+but absolute accuracy reduces with larger values.
+
+> If either of these limitations are a problem, then the data-sketch aggregator
+is most likely a better choice.
+
+## Functionality
+The SpectatorHistogram aggregator is capable of generating histograms from raw 
numeric
+values as well as aggregating/combining pre-aggregated histograms generated 
using
+the SpectatorHistogram aggregator itself.
+While you can generate histograms on the fly at query time, it is generally 
more
+performant to generate histograms during ingestion and then combine them at
+query time. This is especially true where rollup is enabled. It may be 
misleading or 
+incorrect to generate histogram from already rolled-up summed data.
+
+The module provides postAggregators, `percentileSpectatorHistogram` (singular) 
and
+`percentilesSpectatorHistogram` (plural), that can be used to compute 
approximate 
+percentiles from histograms generated by the SpectatorHistogram aggregator.
+Again, these postAggregators can be used to compute percentiles from raw 
numeric
+values via the SpectatorHistogram aggregator or from pre-aggregated histograms.
+
+> If you're only using the aggregator to compute percentiles from raw numeric 
values,
+then you can use the built-in data-sketch aggregator instead. The performance
+and accuracy are comparable, the data-sketch aggregator supports negative 
values,
+and you don't need to load an additional extension.
+ 
+An aggregated SpectatorHistogram can also be queried using a `longSum` or 
`doubleSum`
+aggregator to retrieve the population of the histogram. This is effectively 
the count
+of the number of values that were aggregated into the histogram. This 
flexibility can
+avoid the need to maintain a separate metric for the count of values.
+
+For high-frequency measurements, you may need to pre-aggregate data at the 
client prior
+to sending into Druid. For example, if you're measuring individual image 
render times
+on an image-heavy website, you may want to aggregate the render times for a 
page-view
+into a single histogram prior to sending to Druid in real-time. This can 
reduce the
+amount of data that's needed to send from the client across the wire.
+
+SpectatorHistogram supports ingesting pre-aggregated histograms in real-time 
and batch.
+They can be sent as a JSON map, keyed by the spectator bucket ID and the value 
is the
+count of values. This is the same format as the serialized JSON representation 
of the
+histogram. The keys need not be ordered or contiguous e.g.
+
+```json
+{ "4":  8, "5": 15, "6": 37, "7": 9, "8": 3, "10": 1, "13": 1 }
+```
+
+## Loading the extension
+To use SpectatorHistogram, make sure you 
[include](../../configuration/extensions.md#loading-extensions) the extension 
in your config file:
+
+```
+druid.extensions.loadList=["druid-spectator-histogram"]
+```
+
+## Aggregators
+
+The result of the aggregation is a histogram that is built by ingesting 
numeric values from
+the raw data, or from combining pre-aggregated histograms. The result is 
represented in 
+JSON format where the keys are the bucket index and the values are the count 
of entries
+in that bucket.
+
+The buckets are defined as per the Spectator 
[PercentileBuckets](https://github.com/Netflix/spectator/blob/main/spectator-api/src/main/java/com/netflix/spectator/api/histogram/PercentileBuckets.java)
 specification.
+See [Appendix](#histogram-bucket-boundaries) for the full list of bucket 
boundaries.
+```js
+  // The set of buckets is generated by using powers of 4 and incrementing by 
one-third of the
+  // previous power of 4 in between as long as the value is less than the next 
power of 4 minus
+  // the delta.
+  //
+  // Base: 1, 2, 3
+  //
+  // 4 (4^1), delta = 1 (~1/3 of 4)
+  //     5, 6, 7, ..., 14,
+  //
+  // 16 (4^2), delta = 5 (~1/3 of 16)
+  //    21, 26, 31, ..., 56,
+  //
+  // 64 (4^3), delta = 21 (~1/3 of 64)
+  // ...
+```
+
+There are multiple aggregator types included, all of which are based on the 
same
+underlying implementation. The different types signal to the Atlas-Druid 
service (if using)
+how to handle the resulting data from a query.
+
+* spectatorHistogramTimer signals that the histogram is representing
+a collection of timer values. It is recommended to normalize timer values to 
nanoseconds
+at, or prior to, ingestion. If queried via the Atlas-Druid service, it will
+normalize timers to second resolution at query time as a more natural unit of 
time
+for human consumption.
+* spectatorHistogram and spectatorHistogramDistribution are generic histograms 
that
+can be used to represent any measured value without units. No normalization is
+required or performed.
+
+### `spectatorHistogram` aggregator
+Alias: `spectatorHistogramDistribution`, `spectatorHistogramTimer`
+
+To aggregate at query time:
+```
+{
+  "type" : "spectatorHistogram",
+  "name" : <output_name>,
+  "fieldName" : <column_name>
+ }
+```
+
+| Property  | Description                                                      
                                            | Required? |
+|-----------|--------------------------------------------------------------------------------------------------------------|-----------|
+| type      | This String must be one of "spectatorHistogram", 
"spectatorHistogramTimer", "spectatorHistogramDistribution" | yes       |
+| name      | A String for the output (result) name of the aggregation.        
                                            | yes       |
+| fieldName | A String for the name of the input field containing raw numeric 
values or pre-aggregated histograms.         | yes       |
+
+### `longSum`, `doubleSum` and `floatSum` aggregators
+To get the population size (count of events contributing to the histogram):
+```
+{
+  "type" : "longSum",
+  "name" : <output_name>,
+  "fieldName" : <column_name_of_aggregated_histogram>
+ }
+```
+
+| Property  | Description                                                      
              | Required? |
+|-----------|--------------------------------------------------------------------------------|-----------|
+| type      | Must be "longSum", "doubleSum", or "floatSum".                   
              | yes       |
+| name      | A String for the output (result) name of the aggregation.        
              | yes       |
+| fieldName | A String for the name of the input field containing 
pre-aggregated histograms. | yes       |
+
+## Post Aggregators
+
+### Percentile (singular)
+This returns a single percentile calculation based on the distribution of the 
values in the aggregated histogram.
+
+```
+{
+  "type": "percentileSpectatorHistogram",
+  "name": <output name>,
+  "field": {
+    "type": "fieldAccess",
+    "fieldName": <name of aggregated SpectatorHistogram>
+  },
+  "percentile": <decimal percentile, e.g. 50.0 for median>
+}
+```
+
+| Property   | Description                                                 | 
Required? |
+|------------|-------------------------------------------------------------|-----------|
+| type       | This String should always be "percentileSpectatorHistogram" | 
yes       |
+| name       | A String for the output (result) name of the calculation.   | 
yes       |
+| field      | A field reference pointing to the aggregated histogram.     | 
yes       |
+| percentile | A single decimal percentile between 0.0 and 100.0           | 
yes       |
+
+### Percentiles (multiple)
+This returns an array of percentiles corresponding to those requested.
+
+```
+{
+  "type": "percentilesSpectatorHistogram",
+  "name": <output name>,
+  "field": {
+    "type": "fieldAccess",
+    "fieldName": <name of aggregated SpectatorHistogram>
+  },
+  "percentiles": [25, 50, 75, 99.5]
+}
+```
+
+> Note: It's more efficient to request multiple percentiles in a single query

Review Comment:
   nit: Given this note, would it be a nicer UX if the extension did not 
provide a way to get a single percentile. If users want to get a single 
percentile, they could pass in an array with one element.
   
   I don't have a strong opinion on this, so if you think having both functions 
is better - that's fine with me too.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add SpectatorHistogram extension (druid)

Reply via email to