[PR] Refine EARLIEST/LATEST aggregators (druid)

via GitHub Thu, 11 Jan 2024 13:41:38 -0800


LakshSingla opened a new pull request, #15670:
URL: https://github.com/apache/druid/pull/15670


   ### Description
   
   This PR:
   * Enables MSQ tests on queries using EARLIEST/LATEST/EARLIEST_BY/LATEST_BY 
aggregators
   * Removes the limitations in the docs, since numeric first/last can now be 
used with MSQ and at ingestion time
   * Disallows EARLIEST_BY and LATEST_BY to be used with rolled-up 
pairLongObjects. This is done to prevent the caller from supplying the 
timeExpr, which will get ignored by the native engine, and might be unexpected 
behavior. The correct way to further aggregate such columns would be using 
EARLIEST/LATEST, where the caller understands that the time column would be 
implicitly taken from the rolled-up metric. In the following example:
   * 
   ```sql
   -- Insert data into a table using the following query. 'finalize' should be 
false in the query context to enable rollup
   INSERT INTO dim1 foo EARLIEST_BY(m1, timestampCol1) FROM EXTERN(...) GROUP 
BY dim1
   
   -- Rollup the pre-aggregated metric, with a different timestamp column
   -- In such a case, the native aggregator will ignore the value from the 
timestampCol2 and use the value that was aggregated during the ingestion. To 
prevent such errors, the call is disallowed, and user friendly message is thrown
   SELECT EARLIEST_BY(m1, timestampCol2) FROM foo
     
   -- Rollup, with the column that was used during the ingestion time
   SELECT EARLIEST(m1) FROM foo
   ```
   
   First/Last aggregators call `.toString()` on complex metrics (that aren't 
type of pairLongLong, pairLongString...) and array types, which is also weird, 
however, that hasn't been changed, because that has been supported for a long 
time, and is also documented implicitly. 
   
   Disallowing `EARLIEST_BY(aggregatedMetric, timestampCol2)` will call the 
users to change their queries, however the equivalent call to this is 
`EARLIEST(aggregatedMetric)`, which is a lot more clear, as the explicitly 
typed column by the user isn't ignored. 
   
   #### Release note
   EARLIEST_BY and LATEST_BY cannot be used with complex objects created during 
ingestion (with rollup) with the first/last aggregators. 
   
   
   
   <hr>
   
   ##### Key changed/added classes in this PR
   
   This PR has:
   
   - [ ] been self-reviewed.
      - [ ] using the [concurrency 
checklist](https://github.com/apache/druid/blob/master/dev/code-review/concurrency.md)
 (Remove this item if the PR doesn't have any relation to concurrency.)
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] a release note entry in the PR description.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [ ] added or updated version, license, or notice information in 
[licenses.yaml](https://github.com/apache/druid/blob/master/dev/license.md)
   - [ ] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.
   - [ ] added integration tests.
   - [ ] been tested in a test Druid cluster.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] Refine EARLIEST/LATEST aggregators (druid)

Reply via email to