This is an automated email from the ASF dual-hosted git repository.
suneet pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/druid.git
The following commit(s) were added to refs/heads/master by this push:
new 973e5bf Docs - HLL lgK tip and slight layout change (#11482)
973e5bf is described below
commit 973e5bf7d06c6cb021c242360b2035deb571541d
Author: Peter Marshall <[email protected]>
AuthorDate: Mon Jul 26 20:28:53 2021 +0100
Docs - HLL lgK tip and slight layout change (#11482)
* HLL lgK and a tip
Knowledge transfer from
https://the-asf.slack.com/archives/CJ8D1JTB8/p1600699967024200. Attempted to
make a connection between the SQL HLL function and the HLL underneath without
getting too complicated. Also added a note about using K over 16 being pretty
much pointless.
* Corrected spelling
* Create datasketches-hll.md
Put roll-up back to rollup
* Update docs/development/extensions-core/datasketches-hll.md
Co-authored-by: Abhishek Agarwal
<[email protected]>
Co-authored-by: Abhishek Agarwal
<[email protected]>
---
.../extensions-core/datasketches-hll.md | 42 +++++++++++++++++-----
docs/querying/sql.md | 4 +--
2 files changed, 35 insertions(+), 11 deletions(-)
diff --git a/docs/development/extensions-core/datasketches-hll.md
b/docs/development/extensions-core/datasketches-hll.md
index cc39e7e..c359dc7 100644
--- a/docs/development/extensions-core/datasketches-hll.md
+++ b/docs/development/extensions-core/datasketches-hll.md
@@ -34,6 +34,20 @@ druid.extensions.loadList=["druid-datasketches"]
### Aggregators
+|property|description|required?|
+|--------|-----------|---------|
+|`type`|This String should be [`HLLSketchBuild`](#hllsketchbuild-aggregator)
or [`HLLSketchMerge`](#hllsketchmerge-aggregator)|yes|
+|`name`|A String for the output (result) name of the calculation.|yes|
+|`fieldName`|A String for the name of the input field.|yes|
+|`lgK`|log2 of K that is the number of buckets in the sketch, parameter that
controls the size and the accuracy. Must be a power of 2 from 4 to 21
inclusively.|no, defaults to `12`|
+|`tgtHllType`|The type of the target HLL sketch. Must be `HLL_4`, `HLL_6` or
`HLL_8` |no, defaults to `HLL_4`|
+|`round`|Round off values to whole numbers. Only affects query-time behavior
and is ignored at ingestion-time.|no, defaults to `false`|
+
+
+> The default `lgK` value has proven to be sufficient for most use cases;
expect only very negligible improvements in accuracy with `lgK` values over
`16` in normal circumstances.
+
+#### HLLSketchBuild Aggregator
+
```
{
"type" : "HLLSketchBuild",
@@ -45,6 +59,25 @@ druid.extensions.loadList=["druid-datasketches"]
}
```
+> It is very common to use `HLLSketchBuild` in combination with
[rollup](../../ingestion/index.html#rollup) to create a
[metric](../../ingestion/index.html#metricsspec) on high-cardinality columns.
In this example, a metric called `userid_hll` is included in the `metricsSpec`.
This will perform a HLL sketch on the `userid` field at ingestion time,
allowing for highly-performant approximate `COUNT DISTINCT` query operations
and improving roll-up ratios when `userid` is then left out of [...]
+>
+> ```
+> :
+> "metricsSpec": [
+> {
+> "type" : "HLLSketchBuild",
+> "name" : "userid_hll",
+> "fieldName" : "userid",
+> "lgK" : 12,
+> "tgtHllType" : "HLL_4"
+> }
+> ]
+> :
+> ```
+>
+
+#### HLLSketchMerge Aggregator
+
```
{
"type" : "HLLSketchMerge",
@@ -56,15 +89,6 @@ druid.extensions.loadList=["druid-datasketches"]
}
```
-|property|description|required?|
-|--------|-----------|---------|
-|type|This String should be "HLLSketchBuild" or "HLLSketchMerge"|yes|
-|name|A String for the output (result) name of the calculation.|yes|
-|fieldName|A String for the name of the input field.|yes|
-|lgK|log2 of K that is the number of buckets in the sketch, parameter that
controls the size and the accuracy. Must be a power of 2 from 4 to 21
inclusively.|no, defaults to 12|
-|tgtHllType|The type of the target HLL sketch. Must be "HLL_4",
"HLL_6" or "HLL_8" |no, defaults to "HLL_4"|
-|round|Round off values to whole numbers. Only affects query-time behavior and
is ignored at ingestion-time.|no, defaults to false|
-
### Post Aggregators
#### Estimate
diff --git a/docs/querying/sql.md b/docs/querying/sql.md
index fd5903d..00e801d 100644
--- a/docs/querying/sql.md
+++ b/docs/querying/sql.md
@@ -334,8 +334,8 @@ Only the COUNT and ARRAY_AGG aggregations can accept the
DISTINCT keyword.
|`MAX(expr)`|Takes the maximum of numbers.|`null` if
`druid.generic.useDefaultValueForNull=false`, otherwise `-9223372036854775808`
(minimum LONG value)|
|`AVG(expr)`|Averages numbers.|`null` if
`druid.generic.useDefaultValueForNull=false`, otherwise `0`|
|`APPROX_COUNT_DISTINCT(expr)`|Counts distinct values of expr, which can be a
regular column or a hyperUnique column. This is always approximate, regardless
of the value of "useApproximateCountDistinct". This uses Druid's built-in
"cardinality" or "hyperUnique" aggregators. See also `COUNT(DISTINCT
expr)`.|`0`|
-|`APPROX_COUNT_DISTINCT_DS_HLL(expr, [lgK, tgtHllType])`|Counts distinct
values of expr, which can be a regular column or an [HLL
sketch](../development/extensions-core/datasketches-hll.md) column. The `lgK`
and `tgtHllType` parameters are described in the HLL sketch documentation. This
is always approximate, regardless of the value of
"useApproximateCountDistinct". See also `COUNT(DISTINCT expr)`. The
[DataSketches
extension](../development/extensions-core/datasketches-extension.md) mus [...]
-|`APPROX_COUNT_DISTINCT_DS_THETA(expr, [size])`|Counts distinct values of
expr, which can be a regular column or a [Theta
sketch](../development/extensions-core/datasketches-theta.md) column. The
`size` parameter is described in the Theta sketch documentation. This is always
approximate, regardless of the value of "useApproximateCountDistinct". See also
`COUNT(DISTINCT expr)`. The [DataSketches
extension](../development/extensions-core/datasketches-extension.md) must be
loaded to use thi [...]
+|`APPROX_COUNT_DISTINCT_DS_HLL(expr, [lgK, tgtHllType])`|Counts distinct
values of `expr`, which can be a regular column or an [HLL
sketch](../development/extensions-core/datasketches-hll.md) column. Results are
always approximate, regardless of the value of
[`useApproximateCountDistinct`](../querying/sql.html#connection-context). The
`lgK` and `tgtHllType` parameters here are, like the equivalents in the
[aggregator](../development/extensions-core/datasketches-hll.html#aggregators),
des [...]
+|`APPROX_COUNT_DISTINCT_DS_THETA(expr, [size])`|Counts distinct values of
expr, which can be a regular column or a [Theta
sketch](../development/extensions-core/datasketches-theta.md) column. This is
always approximate, regardless of the value of
[`useApproximateCountDistinct`](../querying/sql.html#connection-context). The
`size` parameter is described in the Theta sketch documentation. The
[DataSketches
extension](../development/extensions-core/datasketches-extension.md) must be
loaded [...]
|`DS_HLL(expr, [lgK, tgtHllType])`|Creates an [HLL
sketch](../development/extensions-core/datasketches-hll.md) on the values of
expr, which can be a regular column or a column containing HLL sketches. The
`lgK` and `tgtHllType` parameters are described in the HLL sketch
documentation. The [DataSketches
extension](../development/extensions-core/datasketches-extension.md) must be
loaded to use this function.|`'0'` (STRING)|
|`DS_THETA(expr, [size])`|Creates a [Theta
sketch](../development/extensions-core/datasketches-theta.md) on the values of
expr, which can be a regular column or a column containing Theta sketches. The
`size` parameter is described in the Theta sketch documentation. The
[DataSketches
extension](../development/extensions-core/datasketches-extension.md) must be
loaded to use this function.|`'0.0'` (STRING)|
|`APPROX_QUANTILE(expr, probability, [resolution])`|Computes approximate
quantiles on numeric or
[approxHistogram](../development/extensions-core/approximate-histograms.md#approximate-histogram-aggregator)
exprs. The "probability" should be between 0 and 1 (exclusive). The
"resolution" is the number of centroids to use for the computation. Higher
resolutions will give more precise results but also have higher overhead. If
not provided, the default resolution is 50. The [approximate histo [...]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]