LATEST_BY for STRING columns without specifying maxStringBytes (druid)

via GitHub Thu, 17 Aug 2023 03:28:02 -0700


clintropolis commented on code in PR #14848:
URL: https://github.com/apache/druid/pull/14848#discussion_r1297023754



##########
docs/querying/sql-aggregations.md:
##########
@@ -82,13 +82,13 @@ In the aggregation functions supported by Druid, only 
`COUNT`, `ARRAY_AGG`, and
 |`STDDEV_POP(expr)`|Computes standard deviation population of `expr`. See 
[stats extension](../development/extensions-core/stats.md) documentation for 
additional details.|`null` if `druid.generic.useDefaultValueForNull=false`, 
otherwise `0`|
 |`STDDEV_SAMP(expr)`|Computes standard deviation sample of `expr`. See [stats 
extension](../development/extensions-core/stats.md) documentation for 
additional details.|`null` if `druid.generic.useDefaultValueForNull=false`, 
otherwise `0`|
 |`STDDEV(expr)`|Computes standard deviation sample of `expr`. See [stats 
extension](../development/extensions-core/stats.md) documentation for 
additional details.|`null` if `druid.generic.useDefaultValueForNull=false`, 
otherwise `0`|
-|`EARLIEST(expr)`|Returns the earliest value of `expr`, which must be numeric. 
If `expr` comes from a relation with a timestamp column (like `__time` in a 
Druid datasource), the "earliest" is taken from the row with the overall 
earliest non-null value of the timestamp column. If the earliest non-null value 
of the timestamp column appears in multiple rows, the `expr` may be taken from 
any of those rows. If `expr` does not come from a relation with a timestamp, 
then it is simply the first value encountered.|`null` if 
`druid.generic.useDefaultValueForNull=false`, otherwise `0`|
+|`EARLIEST(expr)`|Returns the earliest value of `expr`. If `expr` comes from a 
relation with a timestamp column (like `__time` in a Druid datasource), the 
"earliest" is taken from the row with the overall earliest non-null value of 
the timestamp column. If the earliest non-null value of the timestamp column 
appears in multiple rows, the `expr` may be taken from any of those rows. If 
`expr` does not come from a relation with a timestamp, then it is simply the 
first value encountered. For string/complex columns this is equivalent of 
calling `EARLIEST(expr, 1024)`.|`null` if 
`druid.generic.useDefaultValueForNull=false`, otherwise `0`|
 |`EARLIEST(expr, maxBytesPerString)`|Like `EARLIEST(expr)`, but for strings. 
The `maxBytesPerString` parameter determines how much aggregation space to 
allocate per string. Strings longer than this limit are truncated. This 
parameter should be set as low as possible, since high values will lead to 
wasted memory.|`null` if `druid.generic.useDefaultValueForNull=false`, 
otherwise `''`|

Review Comment:
   i think instead of making the number docs talk about strings, it would be 
nicer to adjust the string docs to mark the argument as optional and indicate 
what the default value is there, so it becomes 
   ```
   |`EARLIEST(expr, [maxBytesPerString])`| Like `EARLIEST(expr)`, but for 
strings. The optional `maxBytesPerString` ... 
   ```



##########
docs/querying/sql-aggregations.md:
##########
@@ -82,13 +82,13 @@ In the aggregation functions supported by Druid, only 
`COUNT`, `ARRAY_AGG`, and
 |`STDDEV_POP(expr)`|Computes standard deviation population of `expr`. See 
[stats extension](../development/extensions-core/stats.md) documentation for 
additional details.|`null` if `druid.generic.useDefaultValueForNull=false`, 
otherwise `0`|
 |`STDDEV_SAMP(expr)`|Computes standard deviation sample of `expr`. See [stats 
extension](../development/extensions-core/stats.md) documentation for 
additional details.|`null` if `druid.generic.useDefaultValueForNull=false`, 
otherwise `0`|
 |`STDDEV(expr)`|Computes standard deviation sample of `expr`. See [stats 
extension](../development/extensions-core/stats.md) documentation for 
additional details.|`null` if `druid.generic.useDefaultValueForNull=false`, 
otherwise `0`|
-|`EARLIEST(expr)`|Returns the earliest value of `expr`, which must be numeric. 
If `expr` comes from a relation with a timestamp column (like `__time` in a 
Druid datasource), the "earliest" is taken from the row with the overall 
earliest non-null value of the timestamp column. If the earliest non-null value 
of the timestamp column appears in multiple rows, the `expr` may be taken from 
any of those rows. If `expr` does not come from a relation with a timestamp, 
then it is simply the first value encountered.|`null` if 
`druid.generic.useDefaultValueForNull=false`, otherwise `0`|
+|`EARLIEST(expr)`|Returns the earliest value of `expr`. If `expr` comes from a 
relation with a timestamp column (like `__time` in a Druid datasource), the 
"earliest" is taken from the row with the overall earliest non-null value of 
the timestamp column. If the earliest non-null value of the timestamp column 
appears in multiple rows, the `expr` may be taken from any of those rows. If 
`expr` does not come from a relation with a timestamp, then it is simply the 
first value encountered. For string/complex columns this is equivalent of 
calling `EARLIEST(expr, 1024)`.|`null` if 
`druid.generic.useDefaultValueForNull=false`, otherwise `0`|
 |`EARLIEST(expr, maxBytesPerString)`|Like `EARLIEST(expr)`, but for strings. 
The `maxBytesPerString` parameter determines how much aggregation space to 
allocate per string. Strings longer than this limit are truncated. This 
parameter should be set as low as possible, since high values will lead to 
wasted memory.|`null` if `druid.generic.useDefaultValueForNull=false`, 
otherwise `''`|
-|`EARLIEST_BY(expr, timestampExpr)`|Returns the earliest value of `expr`, 
which must be numeric. The earliest value of `expr` is taken from the row with 
the overall earliest non-null value of `timestampExpr`. If the earliest 
non-null value of `timestampExpr` appears in multiple rows, the `expr` may be 
taken from any of those rows.|`null` if 
`druid.generic.useDefaultValueForNull=false`, otherwise `0`|
+|`EARLIEST_BY(expr, timestampExpr)`|Returns the earliest value of `expr`. The 
earliest value of `expr` is taken from the row with the overall earliest 
non-null value of `timestampExpr`. If the earliest non-null value of 
`timestampExpr` appears in multiple rows, the `expr` may be taken from any of 
those rows. For string/complex columns this is equivalent of calling 
`EARLIEST_BY(expr, timestampExpr, 1024)`.|`null` if 
`druid.generic.useDefaultValueForNull=false`, otherwise `0`|

Review Comment:
   same comment about instead of modifying this to modify the string docs to 
mark the `maxBytesPerString` parameter as optional.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Allow EARLIEST/EARLIEST_BY/LATEST/LATEST_BY for STRING columns without specifying maxStringBytes (druid)

Reply via email to