LATEST_BY for STRING columns without specifying maxStringBytes (druid)

via GitHub Tue, 22 Aug 2023 04:07:57 -0700


kgyrtkirk commented on code in PR #14848:
URL: https://github.com/apache/druid/pull/14848#discussion_r1301478602



##########
docs/querying/sql-aggregations.md:
##########
@@ -86,16 +86,11 @@ In the aggregation functions supported by Druid, only 
`COUNT`, `ARRAY_AGG`, and
 |`STDDEV_POP(expr)`|Computes standard deviation population of `expr`. See 
[stats extension](../development/extensions-core/stats.md) documentation for 
additional details.|`null` if `druid.generic.useDefaultValueForNull=false`, 
otherwise `0`|
 |`STDDEV_SAMP(expr)`|Computes standard deviation sample of `expr`. See [stats 
extension](../development/extensions-core/stats.md) documentation for 
additional details.|`null` if `druid.generic.useDefaultValueForNull=false`, 
otherwise `0`|
 |`STDDEV(expr)`|Computes standard deviation sample of `expr`. See [stats 
extension](../development/extensions-core/stats.md) documentation for 
additional details.|`null` if `druid.generic.useDefaultValueForNull=false`, 
otherwise `0`|
-|`EARLIEST(expr)`|Returns the earliest value of `expr`, which must be numeric. 
If `expr` comes from a relation with a timestamp column (like `__time` in a 
Druid datasource), the "earliest" is taken from the row with the overall 
earliest non-null value of the timestamp column. If the earliest non-null value 
of the timestamp column appears in multiple rows, the `expr` may be taken from 
any of those rows. If `expr` does not come from a relation with a timestamp, 
then it is simply the first value encountered.|`null` if 
`druid.generic.useDefaultValueForNull=false`, otherwise `0`|
-|`EARLIEST(expr, maxBytesPerString)`|Like `EARLIEST(expr)`, but for strings. 
The `maxBytesPerString` parameter determines how much aggregation space to 
allocate per string. Strings longer than this limit are truncated. This 
parameter should be set as low as possible, since high values will lead to 
wasted memory.|`null` if `druid.generic.useDefaultValueForNull=false`, 
otherwise `''`|
-|`EARLIEST_BY(expr, timestampExpr)`|Returns the earliest value of `expr`, 
which must be numeric. The earliest value of `expr` is taken from the row with 
the overall earliest non-null value of `timestampExpr`. If the earliest 
non-null value of `timestampExpr` appears in multiple rows, the `expr` may be 
taken from any of those rows.|`null` if 
`druid.generic.useDefaultValueForNull=false`, otherwise `0`|
-|`EARLIEST_BY(expr, timestampExpr, maxBytesPerString)`| Like 
`EARLIEST_BY(expr, timestampExpr)`, but for strings. The `maxBytesPerString` 
parameter determines how much aggregation space to allocate per string. Strings 
longer than this limit are truncated. This parameter should be set as low as 
possible, since high values will lead to wasted memory.|`null` if 
`druid.generic.useDefaultValueForNull=false`, otherwise `''`|
-|`LATEST(expr)`|Returns the latest value of `expr`, which must be numeric. The 
`expr` must come from a relation with a timestamp column (like `__time` in a 
Druid datasource) and the "latest" is taken from the row with the overall 
latest non-null value of the timestamp column. If the latest non-null value of 
the timestamp column appears in multiple rows, the `expr` may be taken from any 
of those rows. |`null` if `druid.generic.useDefaultValueForNull=false`, 
otherwise `0`|
-|`LATEST(expr, maxBytesPerString)`|Like `LATEST(expr)`, but for strings. The 
`maxBytesPerString` parameter determines how much aggregation space to allocate 
per string. Strings longer than this limit are truncated. This parameter should 
be set as low as possible, since high values will lead to wasted memory.|`null` 
if `druid.generic.useDefaultValueForNull=false`, otherwise `''`|
-|`LATEST_BY(expr, timestampExpr)`|Returns the latest value of `expr`, which 
must be numeric. The latest value of `expr` is taken from the row with the 
overall latest non-null value of `timestampExpr`. If the overall latest 
non-null value of `timestampExpr` appears in multiple rows, the `expr` may be 
taken from any of those rows.|`null` if 
`druid.generic.useDefaultValueForNull=false`, otherwise `0`|
-|`LATEST_BY(expr, timestampExpr, maxBytesPerString)`|Like `LATEST_BY(expr, 
timestampExpr)`, but for strings. The `maxBytesPerString` parameter determines 
how much aggregation space to allocate per string. Strings longer than this 
limit are truncated. This parameter should be set as low as possible, since 
high values will lead to wasted memory.|`null` if 
`druid.generic.useDefaultValueForNull=false`, otherwise `''`|
-|`ANY_VALUE(expr)`|Returns any value of `expr` including null. `expr` must be 
numeric. This aggregator can simplify and optimize the performance by returning 
the first encountered value (including null)|`null` if 
`druid.generic.useDefaultValueForNull=false`, otherwise `0`|
-|`ANY_VALUE(expr, maxBytesPerString)`|Like `ANY_VALUE(expr)`, but for strings. 
The `maxBytesPerString` parameter determines how much aggregation space to 
allocate per string. Strings longer than this limit are truncated. This 
parameter should be set as low as possible, since high values will lead to 
wasted memory.|`null` if `druid.generic.useDefaultValueForNull=false`, 
otherwise `''`|
+|`EARLIEST(expr, [maxBytesPerString])`|Returns the earliest value of 
`expr`.<br />If `expr` comes from a relation with a timestamp column (like 
`__time` in a Druid datasource), the "earliest" is taken from the row with the 
overall earliest non-null value of the timestamp column.<br />If the earliest 
non-null value of the timestamp column appears in multiple rows, the `expr` may 
be taken from any of those rows. If `expr` does not come from a relation with a 
timestamp, then it is simply the first value encountered.<br /><br />If `expr` 
is a string or complex type `maxBytesPerString` amount of space is allocated 
for the aggregation. Strings longer than this limit are truncated. The  
`maxBytesPerString` parameter should be set as low as possible, since high 
values will lead to wasted memory.<br/>If `maxBytesPerString`is omitted; it 
defaults to `1024`. |`null` if `druid.generic.useDefaultValueForNull=false`, 
otherwise `0` or `''` (depending on the type of `expr`)|

Review Comment:
   I was thinking to also replace `maxBytesPerString` everywhere in the main 
codebase - but it seems like this `maxBytesPerString` may also appear in the 
`native` api...so decided against that:
   https://druid.apache.org/docs/latest/querying/aggregations/
   
   I believe a change like that may cause some troubles - so I wonder which 
path should we take:
   * (A) rename those as well
     * I would be worried that by doing this would cause some confusion to 
users / native api client using the system
   * (B) leave them alone
     * current state of the PR
   * (C) undo this rename
     * so the native api is in line with the docs regarding this
   * (X) something else? :D 
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Allow EARLIEST/EARLIEST_BY/LATEST/LATEST_BY for STRING columns without specifying maxStringBytes (druid)

Reply via email to