Re: [PR] Add SpectatorHistogram extension (druid)

via GitHub Tue, 09 Jan 2024 10:01:31 -0800


vtlim commented on code in PR #15340:
URL: https://github.com/apache/druid/pull/15340#discussion_r1446427421



##########
docs/development/extensions-contrib/spectator-histogram.md:
##########
@@ -0,0 +1,453 @@
+---
+id: spectator-histogram
+title: "Spectator Histogram module"
+---
+
+<!--
+  ~ Licensed to the Apache Software Foundation (ASF) under one
+  ~ or more contributor license agreements.  See the NOTICE file
+  ~ distributed with this work for additional information
+  ~ regarding copyright ownership.  The ASF licenses this file
+  ~ to you under the Apache License, Version 2.0 (the
+  ~ "License"); you may not use this file except in compliance
+  ~ with the License.  You may obtain a copy of the License at
+  ~
+  ~   http://www.apache.org/licenses/LICENSE-2.0
+  ~
+  ~ Unless required by applicable law or agreed to in writing,
+  ~ software distributed under the License is distributed on an
+  ~ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+  ~ KIND, either express or implied.  See the License for the
+  ~ specific language governing permissions and limitations
+  ~ under the License.
+  -->
+
+## Summary
+This module provides Apache Druid approximate histogram aggregators and 
percentile
+post-aggregators based on Spectator fixed-bucket histograms.
+
+Consider SpectatorHistogram to compute percentile approximations. This 
extension has a reduced storage footprint compared to the [DataSketches 
extension](../extensions-core/datasketches-extension.md), which results in 
smaller segment sizes, faster loading from deep storage, and lower memory 
usage. This extension provides fast and accurate queries on large datasets at 
low storage cost.
+
+This aggregator only applies when your raw data contains positive long integer 
values. Do not use this aggregator if you have negative values in your data.
+
+In the Druid instance shown below, the example Wikipedia dataset is loaded 3 
times.
+* `wikipedia` contains the dataset ingested as is, without rollup
+* `wikipedia_spectator` contains the dataset with a single extra metric column 
of type `spectatorHistogram` for the `added` column
+* `wikipedia_datasketch` contains the dataset with a single extra metric 
column of type `quantilesDoublesSketch` for the `added` column
+
+Spectator histograms average just 6 extra bytes per row, while the 
`quantilesDoublesSketch`
+adds 48 bytes per row. This represents an eightfold reduction in additional 
storage size for spectator histograms.
+
+![Comparison of datasource sizes in web 
console](../../assets/spectator-histogram-size-comparison.png)
+
+As rollup improves, so does the size savings. For example, when you ingest the 
Wikipedia dataset
+with day-grain query granularity and remove all dimensions except 
`countryName`,
+this results in a segment that has just 106 rows. The base segment has 87 
bytes per row.
+Compare the following bytes per row for SpectatorHistogram versus DataSketches:
+* An additional `spectatorHistogram` column adds 27 bytes per row on average.
+* An additional `quantilesDoublesSketch` column adds 255 bytes per row.
+
+SpectatorHistogram reduces the additional storage size by 9.4 times in this 
example.
+Storage gains will differ per dataset depending on the variance and rollup of 
the data.
+
+## Background
+[Spectator](https://netflix.github.io/atlas-docs/spectator/) is a simple 
library
+for instrumenting code to record dimensional time series data.
+It was built, primarily, to work with 
[Atlas](https://netflix.github.io/atlas-docs/).
+Atlas was developed by Netflix to manage dimensional time series data for near
+real-time operational insight.
+
+With the 
[Atlas-Druid](https://github.com/Netflix-Skunkworks/iep-apps/tree/main/atlas-druid)
+service, it's possible to use the power of Atlas queries, backed by Druid as a
+data store to benefit from high-dimensionality and high-cardinality data.
+
+SpectatorHistogram is designed for efficient parallel aggregations while still
+allowing for filtering and grouping by dimensions. 
+It provides similar functionality to the built-in DataSketches 
`quantilesDoublesSketch` aggregator, but is
+opinionated and optimized for typical measurements from cloud services and web 
apps.
+For example, measurements such as page load time, transferred bytes, response 
time, and request latency.

Review Comment:
   ```suggestion
   It provides similar functionality to the built-in DataSketches 
`quantilesDoublesSketch` aggregator, but is
   opinionated to maintain higher accuracy at smaller values.
   See [Bucket boundaries](#histogram-bucket-boundaries) for more information 
and an example.
   The SpectatorHistogram is optimized for typical measurements from cloud 
services and web apps,
   such as measurements such as page load time, transferred bytes, response 
time, and request latency.
   
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add SpectatorHistogram extension (druid)

Reply via email to