gianm commented on a change in pull request #11188:
URL: https://github.com/apache/druid/pull/11188#discussion_r627818592
##########
File path: docs/querying/sql.md
##########
@@ -313,48 +313,51 @@ possible for two aggregators in the same SQL query to
have different filters.
Only the COUNT and ARRAY_AGG aggregations can accept DISTINCT.
+When no rows are selected, aggregate functions will return their initialized
value for the grouping they belong to. What this value is exactly for a given
aggregator is dependent on the configuration of Druid's SQL compatible null
handling mode, controlled by `druid.generic.useDefaultValueForNull`. The table
below defines the initial values for all aggregate functions in both modes.
Review comment:
Hmm this paragraph reads oddly to me for two reasons:
- At first blush II think a typical user would think it's impossible for
groups to exist that do not have any rows. (It isn't typical SQLy behavior.) So
we should list some examples of when the default value will show up. I can
think of two cases: grand total (aggregations with no `group by`) and filtered
aggregators where the filter does not match any rows within the group.
- Not all aggregators have behavior dependent on
`druid.generic.useDefaultValueForNull`, so it's not technically correct to say
this categorically. I don't think we need to mention
`druid.generic.useDefaultValueForNull` at all here, actually, because the
individual aggregators in the table below call it out when appropriate. Or, if
we do mention it, we could just say that it "may depend on" rather than "is
dependent on".
Welcome to opinions from others about how to express this most clearly.
##########
File path: docs/querying/sql.md
##########
@@ -313,48 +313,51 @@ possible for two aggregators in the same SQL query to
have different filters.
Only the COUNT and ARRAY_AGG aggregations can accept DISTINCT.
+When no rows are selected, aggregate functions will return their initialized
value for the grouping they belong to. What this value is exactly for a given
aggregator is dependent on the configuration of Druid's SQL compatible null
handling mode, controlled by `druid.generic.useDefaultValueForNull`. The table
below defines the initial values for all aggregate functions in both modes.
+
> The order of aggregation operations across segments is not deterministic.
> This means that non-commutative aggregation
> functions can produce inconsistent results across the same query.
>
> Functions that operate on an input type of "float" or "double" may also see
> these differences in aggregation
> results across multiple query runs because of this. If precisely the same
> value is desired across multiple query runs,
> consider using the `ROUND` function to smooth out the inconsistencies
> between queries.
-|Function|Notes|
-|--------|-----|
-|`COUNT(*)`|Counts the number of rows.|
-|`COUNT(DISTINCT expr)`|Counts distinct values of expr, which can be string,
numeric, or hyperUnique. By default this is approximate, using a variant of
[HyperLogLog](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf). To
get exact counts set "useApproximateCountDistinct" to "false". If you do this,
expr must be string or numeric, since exact counts are not possible using
hyperUnique columns. See also `APPROX_COUNT_DISTINCT(expr)`. In exact mode,
only one distinct count per query is permitted unless
`useGroupingSetForExactDistinct` is set to true in query contexts or broker
configurations.|
-|`SUM(expr)`|Sums numbers.|
-|`MIN(expr)`|Takes the minimum of numbers.|
-|`MAX(expr)`|Takes the maximum of numbers.|
-|`AVG(expr)`|Averages numbers.|
-|`APPROX_COUNT_DISTINCT(expr)`|Counts distinct values of expr, which can be a
regular column or a hyperUnique column. This is always approximate, regardless
of the value of "useApproximateCountDistinct". This uses Druid's built-in
"cardinality" or "hyperUnique" aggregators. See also `COUNT(DISTINCT expr)`.|
-|`APPROX_COUNT_DISTINCT_DS_HLL(expr, [lgK, tgtHllType])`|Counts distinct
values of expr, which can be a regular column or an [HLL
sketch](../development/extensions-core/datasketches-hll.md) column. The `lgK`
and `tgtHllType` parameters are described in the HLL sketch documentation. This
is always approximate, regardless of the value of
"useApproximateCountDistinct". See also `COUNT(DISTINCT expr)`. The
[DataSketches
extension](../development/extensions-core/datasketches-extension.md) must be
loaded to use this function.|
-|`APPROX_COUNT_DISTINCT_DS_THETA(expr, [size])`|Counts distinct values of
expr, which can be a regular column or a [Theta
sketch](../development/extensions-core/datasketches-theta.md) column. The
`size` parameter is described in the Theta sketch documentation. This is always
approximate, regardless of the value of "useApproximateCountDistinct". See also
`COUNT(DISTINCT expr)`. The [DataSketches
extension](../development/extensions-core/datasketches-extension.md) must be
loaded to use this function.|
-|`DS_HLL(expr, [lgK, tgtHllType])`|Creates an [HLL
sketch](../development/extensions-core/datasketches-hll.md) on the values of
expr, which can be a regular column or a column containing HLL sketches. The
`lgK` and `tgtHllType` parameters are described in the HLL sketch
documentation. The [DataSketches
extension](../development/extensions-core/datasketches-extension.md) must be
loaded to use this function.|
-|`DS_THETA(expr, [size])`|Creates a [Theta
sketch](../development/extensions-core/datasketches-theta.md) on the values of
expr, which can be a regular column or a column containing Theta sketches. The
`size` parameter is described in the Theta sketch documentation. The
[DataSketches
extension](../development/extensions-core/datasketches-extension.md) must be
loaded to use this function.|
-|`APPROX_QUANTILE(expr, probability, [resolution])`|Computes approximate
quantiles on numeric or
[approxHistogram](../development/extensions-core/approximate-histograms.md#approximate-histogram-aggregator)
exprs. The "probability" should be between 0 and 1 (exclusive). The
"resolution" is the number of centroids to use for the computation. Higher
resolutions will give more precise results but also have higher overhead. If
not provided, the default resolution is 50. The [approximate histogram
extension](../development/extensions-core/approximate-histograms.md) must be
loaded to use this function.|
-|`APPROX_QUANTILE_DS(expr, probability, [k])`|Computes approximate quantiles
on numeric or [Quantiles
sketch](../development/extensions-core/datasketches-quantiles.md) exprs. The
"probability" should be between 0 and 1 (exclusive). The `k` parameter is
described in the Quantiles sketch documentation. The [DataSketches
extension](../development/extensions-core/datasketches-extension.md) must be
loaded to use this function.|
-|`APPROX_QUANTILE_FIXED_BUCKETS(expr, probability, numBuckets, lowerLimit,
upperLimit, [outlierHandlingMode])`|Computes approximate quantiles on numeric
or [fixed buckets
histogram](../development/extensions-core/approximate-histograms.md#fixed-buckets-histogram)
exprs. The "probability" should be between 0 and 1 (exclusive). The
`numBuckets`, `lowerLimit`, `upperLimit`, and `outlierHandlingMode` parameters
are described in the fixed buckets histogram documentation. The [approximate
histogram extension](../development/extensions-core/approximate-histograms.md)
must be loaded to use this function.|
-|`DS_QUANTILES_SKETCH(expr, [k])`|Creates a [Quantiles
sketch](../development/extensions-core/datasketches-quantiles.md) on the values
of expr, which can be a regular column or a column containing quantiles
sketches. The `k` parameter is described in the Quantiles sketch documentation.
The [DataSketches
extension](../development/extensions-core/datasketches-extension.md) must be
loaded to use this function.|
-|`BLOOM_FILTER(expr, numEntries)`|Computes a bloom filter from values produced
by `expr`, with `numEntries` maximum number of distinct values before false
positive rate increases. See [bloom filter
extension](../development/extensions-core/bloom-filter.md) documentation for
additional details.|
-|`TDIGEST_QUANTILE(expr, quantileFraction, [compression])`|Builds a T-Digest
sketch on values produced by `expr` and returns the value for the quantile.
Compression parameter (default value 100) determines the accuracy and size of
the sketch. Higher compression means higher accuracy but more space to store
sketches. See [t-digest
extension](../development/extensions-contrib/tdigestsketch-quantiles.md)
documentation for additional details.|
-|`TDIGEST_GENERATE_SKETCH(expr, [compression])`|Builds a T-Digest sketch on
values produced by `expr`. Compression parameter (default value 100) determines
the accuracy and size of the sketch Higher compression means higher accuracy
but more space to store sketches. See [t-digest
extension](../development/extensions-contrib/tdigestsketch-quantiles.md)
documentation for additional details.|
-|`VAR_POP(expr)`|Computes variance population of `expr`. See [stats
extension](../development/extensions-core/stats.md) documentation for
additional details.|
-|`VAR_SAMP(expr)`|Computes variance sample of `expr`. See [stats
extension](../development/extensions-core/stats.md) documentation for
additional details.|
-|`VARIANCE(expr)`|Computes variance sample of `expr`. See [stats
extension](../development/extensions-core/stats.md) documentation for
additional details.|
-|`STDDEV_POP(expr)`|Computes standard deviation population of `expr`. See
[stats extension](../development/extensions-core/stats.md) documentation for
additional details.|
-|`STDDEV_SAMP(expr)`|Computes standard deviation sample of `expr`. See [stats
extension](../development/extensions-core/stats.md) documentation for
additional details.|
-|`STDDEV(expr)`|Computes standard deviation sample of `expr`. See [stats
extension](../development/extensions-core/stats.md) documentation for
additional details.|
-|`EARLIEST(expr)`|Returns the earliest value of `expr`, which must be numeric.
If `expr` comes from a relation with a timestamp column (like a Druid
datasource) then "earliest" is the value first encountered with the minimum
overall timestamp of all values being aggregated. If `expr` does not come from
a relation with a timestamp, then it is simply the first value encountered.|
-|`EARLIEST(expr, maxBytesPerString)`|Like `EARLIEST(expr)`, but for strings.
The `maxBytesPerString` parameter determines how much aggregation space to
allocate per string. Strings longer than this limit will be truncated. This
parameter should be set as low as possible, since high values will lead to
wasted memory.|
-|`LATEST(expr)`|Returns the latest value of `expr`, which must be numeric. If
`expr` comes from a relation with a timestamp column (like a Druid datasource)
then "latest" is the value last encountered with the maximum overall timestamp
of all values being aggregated. If `expr` does not come from a relation with a
timestamp, then it is simply the last value encountered.|
-|`LATEST(expr, maxBytesPerString)`|Like `LATEST(expr)`, but for strings. The
`maxBytesPerString` parameter determines how much aggregation space to allocate
per string. Strings longer than this limit will be truncated. This parameter
should be set as low as possible, since high values will lead to wasted memory.|
-|`ANY_VALUE(expr)`|Returns any value of `expr` including null. `expr` must be
numeric. This aggregator can simplify and optimize the performance by returning
the first encountered value (including null)|
-|`ANY_VALUE(expr, maxBytesPerString)`|Like `ANY_VALUE(expr)`, but for strings.
The `maxBytesPerString` parameter determines how much aggregation space to
allocate per string. Strings longer than this limit will be truncated. This
parameter should be set as low as possible, since high values will lead to
wasted memory.|
-|`GROUPING(expr, expr...)`|Returns a number to indicate which groupBy
dimension is included in a row, when using `GROUPING SETS`. Refer to
[additional documentation](aggregations.md#grouping-aggregator) on how to infer
this number.|
-|`ARRAY_AGG(expr, [size])`|Collects all values of `expr` into an ARRAY,
including null values, with `size` in bytes limit on aggregation size (default
of 1024 bytes). Use of `ORDER BY` within the `ARRAY_AGG` expression is not
currently supported, and the ordering of results within the output array may
vary depending on processing order.|
-|`ARRAY_AGG(DISTINCT expr, [size])`|Collects all distinct values of `expr`
into an ARRAY, including null values, with `size` in bytes limit on aggregation
size (default of 1024 bytes) per aggregate. Use of `ORDER BY` within the
`ARRAY_AGG` expression is not currently supported, and the ordering of results
within the output array may vary depending on processing order.|
+|Function|Notes|Default|
+|--------|-----|-------|
+|`COUNT(*)`|Counts the number of rows.|`0`|
+|`COUNT(DISTINCT expr)`|Counts distinct values of expr, which can be string,
numeric, or hyperUnique. By default this is approximate, using a variant of
[HyperLogLog](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf). To
get exact counts set "useApproximateCountDistinct" to "false". If you do this,
expr must be string or numeric, since exact counts are not possible using
hyperUnique columns. See also `APPROX_COUNT_DISTINCT(expr)`. In exact mode,
only one distinct count per query is permitted unless
`useGroupingSetForExactDistinct` is set to true in query contexts or broker
configurations.|`0`|
+|`SUM(expr)`|Sums numbers.|`null` if
`druid.generic.useDefaultValueForNull=false`, otherwise `0`|
+|`MIN(expr)`|Takes the minimum of numbers.|`null` if
`druid.generic.useDefaultValueForNull=false`, otherwise `9223372036854775807`
(maximum LONG value)|
+|`MAX(expr)`|Takes the maximum of numbers.|`0` in 'default' mode, `null` in
SQL compatible mode|
Review comment:
Missed this one?
##########
File path: docs/querying/sql.md
##########
@@ -313,46 +313,48 @@ possible for two aggregators in the same SQL query to
have different filters.
Only the COUNT aggregation can accept DISTINCT.
+When no rows are selected, aggregate functions will return their initialized
value for the grouping they belong to. What this value is exactly for a given
aggregator is dependent on the configuration of Druid's SQL compatible null
handling mode, controlled by `druid.generic.useDefaultValueForNull`. The table
below defines the initial values for all aggregate functions in both modes.
+
> The order of aggregation operations across segments is not deterministic.
> This means that non-commutative aggregation
> functions can produce inconsistent results across the same query.
>
> Functions that operate on an input type of "float" or "double" may also see
> these differences in aggregation
> results across multiple query runs because of this. If precisely the same
> value is desired across multiple query runs,
> consider using the `ROUND` function to smooth out the inconsistencies
> between queries.
-|Function|Notes|
-|--------|-----|
-|`COUNT(*)`|Counts the number of rows.|
-|`COUNT(DISTINCT expr)`|Counts distinct values of expr, which can be string,
numeric, or hyperUnique. By default this is approximate, using a variant of
[HyperLogLog](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf). To
get exact counts set "useApproximateCountDistinct" to "false". If you do this,
expr must be string or numeric, since exact counts are not possible using
hyperUnique columns. See also `APPROX_COUNT_DISTINCT(expr)`. In exact mode,
only one distinct count per query is permitted unless
`useGroupingSetForExactDistinct` is set to true in query contexts or broker
configurations.|
-|`SUM(expr)`|Sums numbers.|
-|`MIN(expr)`|Takes the minimum of numbers.|
-|`MAX(expr)`|Takes the maximum of numbers.|
-|`AVG(expr)`|Averages numbers.|
-|`APPROX_COUNT_DISTINCT(expr)`|Counts distinct values of expr, which can be a
regular column or a hyperUnique column. This is always approximate, regardless
of the value of "useApproximateCountDistinct". This uses Druid's built-in
"cardinality" or "hyperUnique" aggregators. See also `COUNT(DISTINCT expr)`.|
-|`APPROX_COUNT_DISTINCT_DS_HLL(expr, [lgK, tgtHllType])`|Counts distinct
values of expr, which can be a regular column or an [HLL
sketch](../development/extensions-core/datasketches-hll.md) column. The `lgK`
and `tgtHllType` parameters are described in the HLL sketch documentation. This
is always approximate, regardless of the value of
"useApproximateCountDistinct". See also `COUNT(DISTINCT expr)`. The
[DataSketches
extension](../development/extensions-core/datasketches-extension.md) must be
loaded to use this function.|
-|`APPROX_COUNT_DISTINCT_DS_THETA(expr, [size])`|Counts distinct values of
expr, which can be a regular column or a [Theta
sketch](../development/extensions-core/datasketches-theta.md) column. The
`size` parameter is described in the Theta sketch documentation. This is always
approximate, regardless of the value of "useApproximateCountDistinct". See also
`COUNT(DISTINCT expr)`. The [DataSketches
extension](../development/extensions-core/datasketches-extension.md) must be
loaded to use this function.|
-|`DS_HLL(expr, [lgK, tgtHllType])`|Creates an [HLL
sketch](../development/extensions-core/datasketches-hll.md) on the values of
expr, which can be a regular column or a column containing HLL sketches. The
`lgK` and `tgtHllType` parameters are described in the HLL sketch
documentation. The [DataSketches
extension](../development/extensions-core/datasketches-extension.md) must be
loaded to use this function.|
-|`DS_THETA(expr, [size])`|Creates a [Theta
sketch](../development/extensions-core/datasketches-theta.md) on the values of
expr, which can be a regular column or a column containing Theta sketches. The
`size` parameter is described in the Theta sketch documentation. The
[DataSketches
extension](../development/extensions-core/datasketches-extension.md) must be
loaded to use this function.|
-|`APPROX_QUANTILE(expr, probability, [resolution])`|Computes approximate
quantiles on numeric or
[approxHistogram](../development/extensions-core/approximate-histograms.md#approximate-histogram-aggregator)
exprs. The "probability" should be between 0 and 1 (exclusive). The
"resolution" is the number of centroids to use for the computation. Higher
resolutions will give more precise results but also have higher overhead. If
not provided, the default resolution is 50. The [approximate histogram
extension](../development/extensions-core/approximate-histograms.md) must be
loaded to use this function.|
-|`APPROX_QUANTILE_DS(expr, probability, [k])`|Computes approximate quantiles
on numeric or [Quantiles
sketch](../development/extensions-core/datasketches-quantiles.md) exprs. The
"probability" should be between 0 and 1 (exclusive). The `k` parameter is
described in the Quantiles sketch documentation. The [DataSketches
extension](../development/extensions-core/datasketches-extension.md) must be
loaded to use this function.|
-|`APPROX_QUANTILE_FIXED_BUCKETS(expr, probability, numBuckets, lowerLimit,
upperLimit, [outlierHandlingMode])`|Computes approximate quantiles on numeric
or [fixed buckets
histogram](../development/extensions-core/approximate-histograms.md#fixed-buckets-histogram)
exprs. The "probability" should be between 0 and 1 (exclusive). The
`numBuckets`, `lowerLimit`, `upperLimit`, and `outlierHandlingMode` parameters
are described in the fixed buckets histogram documentation. The [approximate
histogram extension](../development/extensions-core/approximate-histograms.md)
must be loaded to use this function.|
-|`DS_QUANTILES_SKETCH(expr, [k])`|Creates a [Quantiles
sketch](../development/extensions-core/datasketches-quantiles.md) on the values
of expr, which can be a regular column or a column containing quantiles
sketches. The `k` parameter is described in the Quantiles sketch documentation.
The [DataSketches
extension](../development/extensions-core/datasketches-extension.md) must be
loaded to use this function.|
-|`BLOOM_FILTER(expr, numEntries)`|Computes a bloom filter from values produced
by `expr`, with `numEntries` maximum number of distinct values before false
positive rate increases. See [bloom filter
extension](../development/extensions-core/bloom-filter.md) documentation for
additional details.|
-|`TDIGEST_QUANTILE(expr, quantileFraction, [compression])`|Builds a T-Digest
sketch on values produced by `expr` and returns the value for the quantile.
Compression parameter (default value 100) determines the accuracy and size of
the sketch. Higher compression means higher accuracy but more space to store
sketches. See [t-digest
extension](../development/extensions-contrib/tdigestsketch-quantiles.md)
documentation for additional details.|
-|`TDIGEST_GENERATE_SKETCH(expr, [compression])`|Builds a T-Digest sketch on
values produced by `expr`. Compression parameter (default value 100) determines
the accuracy and size of the sketch Higher compression means higher accuracy
but more space to store sketches. See [t-digest
extension](../development/extensions-contrib/tdigestsketch-quantiles.md)
documentation for additional details.|
-|`VAR_POP(expr)`|Computes variance population of `expr`. See [stats
extension](../development/extensions-core/stats.md) documentation for
additional details.|
-|`VAR_SAMP(expr)`|Computes variance sample of `expr`. See [stats
extension](../development/extensions-core/stats.md) documentation for
additional details.|
-|`VARIANCE(expr)`|Computes variance sample of `expr`. See [stats
extension](../development/extensions-core/stats.md) documentation for
additional details.|
-|`STDDEV_POP(expr)`|Computes standard deviation population of `expr`. See
[stats extension](../development/extensions-core/stats.md) documentation for
additional details.|
-|`STDDEV_SAMP(expr)`|Computes standard deviation sample of `expr`. See [stats
extension](../development/extensions-core/stats.md) documentation for
additional details.|
-|`STDDEV(expr)`|Computes standard deviation sample of `expr`. See [stats
extension](../development/extensions-core/stats.md) documentation for
additional details.|
-|`EARLIEST(expr)`|Returns the earliest value of `expr`, which must be numeric.
If `expr` comes from a relation with a timestamp column (like a Druid
datasource) then "earliest" is the value first encountered with the minimum
overall timestamp of all values being aggregated. If `expr` does not come from
a relation with a timestamp, then it is simply the first value encountered.|
-|`EARLIEST(expr, maxBytesPerString)`|Like `EARLIEST(expr)`, but for strings.
The `maxBytesPerString` parameter determines how much aggregation space to
allocate per string. Strings longer than this limit will be truncated. This
parameter should be set as low as possible, since high values will lead to
wasted memory.|
-|`LATEST(expr)`|Returns the latest value of `expr`, which must be numeric. If
`expr` comes from a relation with a timestamp column (like a Druid datasource)
then "latest" is the value last encountered with the maximum overall timestamp
of all values being aggregated. If `expr` does not come from a relation with a
timestamp, then it is simply the last value encountered.|
-|`LATEST(expr, maxBytesPerString)`|Like `LATEST(expr)`, but for strings. The
`maxBytesPerString` parameter determines how much aggregation space to allocate
per string. Strings longer than this limit will be truncated. This parameter
should be set as low as possible, since high values will lead to wasted memory.|
-|`ANY_VALUE(expr)`|Returns any value of `expr` including null. `expr` must be
numeric. This aggregator can simplify and optimize the performance by returning
the first encountered value (including null)|
-|`ANY_VALUE(expr, maxBytesPerString)`|Like `ANY_VALUE(expr)`, but for strings.
The `maxBytesPerString` parameter determines how much aggregation space to
allocate per string. Strings longer than this limit will be truncated. This
parameter should be set as low as possible, since high values will lead to
wasted memory.|
-|`GROUPING(expr, expr...)`|Returns a number to indicate which groupBy
dimension is included in a row, when using `GROUPING SETS`. Refer to
[additional documentation](aggregations.md#grouping-aggregator) on how to infer
this number.|
+|Function|Notes|Default|
+|--------|-----|-------|
+|`COUNT(*)`|Counts the number of rows.|`0`|
+|`COUNT(DISTINCT expr)`|Counts distinct values of expr, which can be string,
numeric, or hyperUnique. By default this is approximate, using a variant of
[HyperLogLog](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf). To
get exact counts set "useApproximateCountDistinct" to "false". If you do this,
expr must be string or numeric, since exact counts are not possible using
hyperUnique columns. See also `APPROX_COUNT_DISTINCT(expr)`. In exact mode,
only one distinct count per query is permitted unless
`useGroupingSetForExactDistinct` is set to true in query contexts or broker
configurations.|`0`|
+|`SUM(expr)`|Sums numbers.|`0` in 'default' mode, `null` in SQL compatible
mode|
+|`MIN(expr)`|Takes the minimum of numbers.|`Long.MAX_VALUE` in 'default' mode,
`null` in SQL compatible mode|
+|`MAX(expr)`|Takes the maximum of numbers.|`Long.MIN_VALUE` in 'default' mode,
`null` in SQL compatible mode|
+|`AVG(expr)`|Averages numbers.|`0` in 'default' mode, `null` in SQL compatible
mode|
+|`APPROX_COUNT_DISTINCT(expr)`|Counts distinct values of expr, which can be a
regular column or a hyperUnique column. This is always approximate, regardless
of the value of "useApproximateCountDistinct". This uses Druid's built-in
"cardinality" or "hyperUnique" aggregators. See also `COUNT(DISTINCT
expr)`.|`0`|
+|`APPROX_COUNT_DISTINCT_DS_HLL(expr, [lgK, tgtHllType])`|Counts distinct
values of expr, which can be a regular column or an [HLL
sketch](../development/extensions-core/datasketches-hll.md) column. The `lgK`
and `tgtHllType` parameters are described in the HLL sketch documentation. This
is always approximate, regardless of the value of
"useApproximateCountDistinct". See also `COUNT(DISTINCT expr)`. The
[DataSketches
extension](../development/extensions-core/datasketches-extension.md) must be
loaded to use this function.|`0`|
+|`APPROX_COUNT_DISTINCT_DS_THETA(expr, [size])`|Counts distinct values of
expr, which can be a regular column or a [Theta
sketch](../development/extensions-core/datasketches-theta.md) column. The
`size` parameter is described in the Theta sketch documentation. This is always
approximate, regardless of the value of "useApproximateCountDistinct". See also
`COUNT(DISTINCT expr)`. The [DataSketches
extension](../development/extensions-core/datasketches-extension.md) must be
loaded to use this function.|`0`|
+|`DS_HLL(expr, [lgK, tgtHllType])`|Creates an [HLL
sketch](../development/extensions-core/datasketches-hll.md) on the values of
expr, which can be a regular column or a column containing HLL sketches. The
`lgK` and `tgtHllType` parameters are described in the HLL sketch
documentation. The [DataSketches
extension](../development/extensions-core/datasketches-extension.md) must be
loaded to use this function.|`'0'` (STRING)|
+|`DS_THETA(expr, [size])`|Creates a [Theta
sketch](../development/extensions-core/datasketches-theta.md) on the values of
expr, which can be a regular column or a column containing Theta sketches. The
`size` parameter is described in the Theta sketch documentation. The
[DataSketches
extension](../development/extensions-core/datasketches-extension.md) must be
loaded to use this function.|`'0.0'` (STRING)|
Review comment:
> Hmm, it actually returns a double, but we don't examine the finalized
type so calcite thinks it is complex which is I guess how it ends up as a
string instead of double because it just tries to serialize complex values (and
I assume the same is true of the other sketches that return a string result).
Hmm, weird, but, OK. We might want to change it to something saner later,
but we don't have to do that right now.
> I guess this also brings up the question of if we need to describe the
difference between intermediary types and finalized types here
I think the way you did it is right.
I don't think we need to describe the intermediary / finalized type
difference in user-facing documentation. That should be an internal detail. IMO
if there's cases where users might benefit from seeing the non-finalized
values, it'd be better to expose them through postaggregators that have well
defined semantics (like sketch_bytes_as_base64 or something).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]