[druid] branch master updated: Docs - HLL lgK tip and slight layout change (#11482)

suneet Mon, 26 Jul 2021 12:29:26 -0700

This is an automated email from the ASF dual-hosted git repository.

suneet pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/druid.git



The following commit(s) were added to refs/heads/master by this push:
     new 973e5bf  Docs - HLL lgK tip and slight layout change (#11482)
973e5bf is described below

commit 973e5bf7d06c6cb021c242360b2035deb571541d
Author: Peter Marshall <[email protected]>
AuthorDate: Mon Jul 26 20:28:53 2021 +0100

    Docs - HLL lgK tip and slight layout change (#11482)
    
    * HLL lgK and a tip
    
    Knowledge transfer from 
https://the-asf.slack.com/archives/CJ8D1JTB8/p1600699967024200.  Attempted to 
make a connection between the SQL HLL function and the HLL underneath without 
getting too complicated.  Also added a note about using K over 16 being pretty 
much pointless.
    
    * Corrected spelling
    
    * Create datasketches-hll.md
    
    Put roll-up back to rollup
    
    * Update docs/development/extensions-core/datasketches-hll.md
    
    Co-authored-by: Abhishek Agarwal 
<[email protected]>
    
    Co-authored-by: Abhishek Agarwal 
<[email protected]>
---
 .../extensions-core/datasketches-hll.md            | 42 +++++++++++++++++-----
 docs/querying/sql.md                               |  4 +--
 2 files changed, 35 insertions(+), 11 deletions(-)

diff --git a/docs/development/extensions-core/datasketches-hll.md 
b/docs/development/extensions-core/datasketches-hll.md
index cc39e7e..c359dc7 100644
--- a/docs/development/extensions-core/datasketches-hll.md
+++ b/docs/development/extensions-core/datasketches-hll.md
@@ -34,6 +34,20 @@ druid.extensions.loadList=["druid-datasketches"]
 
 ### Aggregators
 
+|property|description|required?|
+|--------|-----------|---------|
+|`type`|This String should be [`HLLSketchBuild`](#hllsketchbuild-aggregator) 
or [`HLLSketchMerge`](#hllsketchmerge-aggregator)|yes|
+|`name`|A String for the output (result) name of the calculation.|yes|
+|`fieldName`|A String for the name of the input field.|yes|
+|`lgK`|log2 of K that is the number of buckets in the sketch, parameter that 
controls the size and the accuracy. Must be a power of 2 from 4 to 21 
inclusively.|no, defaults to `12`|
+|`tgtHllType`|The type of the target HLL sketch. Must be `HLL_4`, `HLL_6` or 
`HLL_8` |no, defaults to `HLL_4`|
+|`round`|Round off values to whole numbers. Only affects query-time behavior 
and is ignored at ingestion-time.|no, defaults to `false`|
+
+
+> The default `lgK` value has proven to be sufficient for most use cases; 
expect only very negligible improvements in accuracy with `lgK` values over 
`16` in normal circumstances.
+
+#### HLLSketchBuild Aggregator
+
 ```
 {
   "type" : "HLLSketchBuild",
@@ -45,6 +59,25 @@ druid.extensions.loadList=["druid-datasketches"]
  }
 ```
 
+> It is very common to use `HLLSketchBuild` in combination with 
[rollup](../../ingestion/index.html#rollup) to create a 
[metric](../../ingestion/index.html#metricsspec) on high-cardinality columns.  
In this example, a metric called `userid_hll` is included in the `metricsSpec`. 
 This will perform a HLL sketch on the `userid` field at ingestion time, 
allowing for highly-performant approximate `COUNT DISTINCT` query operations 
and improving roll-up ratios when `userid` is then left out of  [...]
+>
+> ```
+> :
+> "metricsSpec": [
+>  {
+>    "type" : "HLLSketchBuild",
+>    "name" : "userid_hll",
+>    "fieldName" : "userid",
+>    "lgK" : 12,
+>    "tgtHllType" : "HLL_4"
+>  }
+> ]
+> :
+> ```
+>
+
+#### HLLSketchMerge Aggregator
+
 ```
 {
   "type" : "HLLSketchMerge",
@@ -56,15 +89,6 @@ druid.extensions.loadList=["druid-datasketches"]
  }
 ```
 
-|property|description|required?|
-|--------|-----------|---------|
-|type|This String should be "HLLSketchBuild" or "HLLSketchMerge"|yes|
-|name|A String for the output (result) name of the calculation.|yes|
-|fieldName|A String for the name of the input field.|yes|
-|lgK|log2 of K that is the number of buckets in the sketch, parameter that 
controls the size and the accuracy. Must be a power of 2 from 4 to 21 
inclusively.|no, defaults to 12|
-|tgtHllType|The type of the target HLL sketch. Must be "HLL&lowbar;4", 
"HLL&lowbar;6" or "HLL&lowbar;8" |no, defaults to "HLL&lowbar;4"|
-|round|Round off values to whole numbers. Only affects query-time behavior and 
is ignored at ingestion-time.|no, defaults to false|
-
 ### Post Aggregators
 
 #### Estimate
diff --git a/docs/querying/sql.md b/docs/querying/sql.md
index fd5903d..00e801d 100644
--- a/docs/querying/sql.md
+++ b/docs/querying/sql.md
@@ -334,8 +334,8 @@ Only the COUNT and ARRAY_AGG aggregations can accept the 
DISTINCT keyword.
 |`MAX(expr)`|Takes the maximum of numbers.|`null` if 
`druid.generic.useDefaultValueForNull=false`, otherwise `-9223372036854775808` 
(minimum LONG value)|
 |`AVG(expr)`|Averages numbers.|`null` if 
`druid.generic.useDefaultValueForNull=false`, otherwise `0`|
 |`APPROX_COUNT_DISTINCT(expr)`|Counts distinct values of expr, which can be a 
regular column or a hyperUnique column. This is always approximate, regardless 
of the value of "useApproximateCountDistinct". This uses Druid's built-in 
"cardinality" or "hyperUnique" aggregators. See also `COUNT(DISTINCT 
expr)`.|`0`|
-|`APPROX_COUNT_DISTINCT_DS_HLL(expr, [lgK, tgtHllType])`|Counts distinct 
values of expr, which can be a regular column or an [HLL 
sketch](../development/extensions-core/datasketches-hll.md) column. The `lgK` 
and `tgtHllType` parameters are described in the HLL sketch documentation. This 
is always approximate, regardless of the value of 
"useApproximateCountDistinct". See also `COUNT(DISTINCT expr)`. The 
[DataSketches 
extension](../development/extensions-core/datasketches-extension.md) mus [...]
-|`APPROX_COUNT_DISTINCT_DS_THETA(expr, [size])`|Counts distinct values of 
expr, which can be a regular column or a [Theta 
sketch](../development/extensions-core/datasketches-theta.md) column. The 
`size` parameter is described in the Theta sketch documentation. This is always 
approximate, regardless of the value of "useApproximateCountDistinct". See also 
`COUNT(DISTINCT expr)`. The [DataSketches 
extension](../development/extensions-core/datasketches-extension.md) must be 
loaded to use thi [...]
+|`APPROX_COUNT_DISTINCT_DS_HLL(expr, [lgK, tgtHllType])`|Counts distinct 
values of `expr`, which can be a regular column or an [HLL 
sketch](../development/extensions-core/datasketches-hll.md) column. Results are 
always approximate, regardless of the value of 
[`useApproximateCountDistinct`](../querying/sql.html#connection-context). The 
`lgK` and `tgtHllType` parameters here are, like the equivalents in the 
[aggregator](../development/extensions-core/datasketches-hll.html#aggregators), 
des [...]
+|`APPROX_COUNT_DISTINCT_DS_THETA(expr, [size])`|Counts distinct values of 
expr, which can be a regular column or a [Theta 
sketch](../development/extensions-core/datasketches-theta.md) column. This is 
always approximate, regardless of the value of 
[`useApproximateCountDistinct`](../querying/sql.html#connection-context).  The 
`size` parameter is described in the Theta sketch documentation. The 
[DataSketches 
extension](../development/extensions-core/datasketches-extension.md) must be 
loaded [...]
 |`DS_HLL(expr, [lgK, tgtHllType])`|Creates an [HLL 
sketch](../development/extensions-core/datasketches-hll.md) on the values of 
expr, which can be a regular column or a column containing HLL sketches. The 
`lgK` and `tgtHllType` parameters are described in the HLL sketch 
documentation. The [DataSketches 
extension](../development/extensions-core/datasketches-extension.md) must be 
loaded to use this function.|`'0'` (STRING)|
 |`DS_THETA(expr, [size])`|Creates a [Theta 
sketch](../development/extensions-core/datasketches-theta.md) on the values of 
expr, which can be a regular column or a column containing Theta sketches. The 
`size` parameter is described in the Theta sketch documentation. The 
[DataSketches 
extension](../development/extensions-core/datasketches-extension.md) must be 
loaded to use this function.|`'0.0'` (STRING)|
 |`APPROX_QUANTILE(expr, probability, [resolution])`|Computes approximate 
quantiles on numeric or 
[approxHistogram](../development/extensions-core/approximate-histograms.md#approximate-histogram-aggregator)
 exprs. The "probability" should be between 0 and 1 (exclusive). The 
"resolution" is the number of centroids to use for the computation. Higher 
resolutions will give more precise results but also have higher overhead. If 
not provided, the default resolution is 50. The [approximate histo [...]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[druid] branch master updated: Docs - HLL lgK tip and slight layout change (#11482)

Reply via email to