Re: [PR] Disallow the EARLIEST_BY/LATEST_BY aggregators from aggregating on complex long(Type) pair objects (druid)

via GitHub Mon, 29 Jan 2024 11:15:54 -0800


vtlim commented on code in PR #15670:
URL: https://github.com/apache/druid/pull/15670#discussion_r1470082875



##########
docs/querying/sql-aggregations.md:
##########
@@ -87,9 +87,9 @@ In the aggregation functions supported by Druid, only 
`COUNT`, `ARRAY_AGG`, and
 |`STDDEV_SAMP(expr)`|Computes standard deviation sample of `expr`. See [stats 
extension](../development/extensions-core/stats.md) documentation for 
additional details.|`null` or `0` if 
`druid.generic.useDefaultValueForNull=true` (legacy mode)|
 |`STDDEV(expr)`|Computes standard deviation sample of `expr`. See [stats 
extension](../development/extensions-core/stats.md) documentation for 
additional details.|`null` or `0` if 
`druid.generic.useDefaultValueForNull=true` (legacy mode)|
 |`EARLIEST(expr, [maxBytesPerValue])`|Returns the earliest value of `expr`.<br 
/>If `expr` comes from a relation with a timestamp column (like `__time` in a 
Druid datasource), the "earliest" is taken from the row with the overall 
earliest non-null value of the timestamp column.<br />If the earliest non-null 
value of the timestamp column appears in multiple rows, the `expr` may be taken 
from any of those rows. If `expr` does not come from a relation with a 
timestamp, then it is simply the first value encountered.<br /><br />If `expr` 
is a string or complex type `maxBytesPerValue` amount of space is allocated for 
the aggregation. Strings longer than this limit are truncated. The  
`maxBytesPerValue` parameter should be set as low as possible, since high 
values will lead to wasted memory.<br/>If `maxBytesPerValue`is omitted; it 
defaults to `1024`. |`null` or `0`/`''` if 
`druid.generic.useDefaultValueForNull=true` (legacy mode)|
-|`EARLIEST_BY(expr, timestampExpr, [maxBytesPerValue])`|Returns the earliest 
value of `expr`.<br />The earliest value of `expr` is taken from the row with 
the overall earliest non-null value of `timestampExpr`. <br />If the earliest 
non-null value of `timestampExpr` appears in multiple rows, the `expr` may be 
taken from any of those rows.<br /><br />If `expr` is a string or complex type 
`maxBytesPerValue` amount of space is allocated for the aggregation. Strings 
longer than this limit are truncated. The  `maxBytesPerValue` parameter should 
be set as low as possible, since high values will lead to wasted memory.<br/>If 
`maxBytesPerValue`is omitted; it defaults to `1024`. |`null` or `0`/`''` if 
`druid.generic.useDefaultValueForNull=true` (legacy mode)|
+|`EARLIEST_BY(expr, timestampExpr, [maxBytesPerValue])`|Returns the earliest 
value of `expr`.<br />The earliest value of `expr` is taken from the row with 
the overall earliest non-null value of `timestampExpr`. <br />If the earliest 
non-null value of `timestampExpr` appears in multiple rows, the `expr` may be 
taken from any of those rows.<br /><br />If `expr` is a string or complex type 
`maxBytesPerValue` amount of space is allocated for the aggregation. Strings 
longer than this limit are truncated. The  `maxBytesPerValue` parameter should 
be set as low as possible, since high values will lead to wasted memory.<br/>If 
`maxBytesPerValue`is omitted; it defaults to `1024`. It cannot be used with the 
rollup tables created with any variant of the 
EARLIEST/LATEST/EARLIEST_BY/LATEST_BY aggregator since the intermediate type 
will already contain the timestamp and the one passed explicitly will get 
ignored by the native engine. USE `EARLIEST` in such scenario. |`null` or 
`0`/`''` if `druid.g
 eneric.useDefaultValueForNull=true` (legacy mode)|

Review Comment:
   Suggesting rephrasing the new additions as follows:
   
   ```
   <br /<br />Use `EARLIEST` instead of `EARLIEST_BY` on a table that has 
rollup enabled and was created with any variant of `EARLIEST`, `LATEST`, 
`EARLIEST_BY`, or `LATEST_BY`. In these cases, the intermediate type already 
stores the timestamp, and Druid ignores the value passed in `timestampExpr`.
   ```
   
   
   * Is the behavior different between the native engine and multi-stage query 
engine? If yes, we should mention what MSQ does. If not, we can remove "by the 
native engine" from this description.
   * If Druid ignores the explicitly set timestamp, is EARLIEST_BY equivalent 
to using EARLIEST, or would it fail?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Disallow the EARLIEST_BY/LATEST_BY aggregators from aggregating on complex long(Type) pair objects (druid)

Reply via email to