imply-cheddar opened a new pull request, #13416:
URL: https://github.com/apache/druid/pull/13416

   ### Description
   
   This commit adds support for Window functions to Druid.  The intent is to 
support window functions similarly to what postgresql does 
(https://www.postgresql.org/docs/current/functions-window.html) as well as 
Drill (https://drill.apache.org/docs/sql-window-functions-introduction/)
   
   The commit is large and there is more work to do, but this lays the 
foundation of supporting window functions.  This PR is best read from a few 
entry points:
   
   1. The windowing functions are implemented as `Processor`s.  Looking at this 
interface will give an idea of how the actual window functions are implemented. 
 The biggest notion here is that `Processor`s deal with a `RowsAndColumns` 
object that represents a full partition.  Generally speaking, they believe that 
something else has prepared the `RowsAndColumns` for them which proves to be a 
nice simplifying assumption for their implementation.
   2. `RowsAndColumns` is an interface that represents a set of... rows and 
columns.  If you look at the interface, it is rather minimalist.  The idea here 
is to lean into the idea of `.as()` as has existed on Segments for a long time. 
 We use it to effectively build a "menu" of common, generic functionality that 
can be done on a batch of data and then Operators/Processors can be written in 
terms of that common, generic functionality.  The implementation in this PR 
provides naive implementations of this functionality, as we get deeper in 
future PRs, this functionality will get fleshed out and specialized more to 
avoid object copies and megamorphism while further offering love for vectorized 
processing.
   3. `Operator` is an interface introduced here.  `WindowOperatorQuery` is an 
operator-defined query for running Window operations on top of the results of a 
sub-query.  All of the logic for the operators is handled in 
`WindowOperatorQueryQueryToolChest.merge` right now.  Essentially, this PR has 
been co-opted to get Operators introduced into the Druid code base.  This means 
we are using window queries as an initial jumping off point for `Operators` to 
be introduced into the code flow.  We expect more and more iteration on this to 
expand the capabilities deeper and perhaps make a future world where 
Operator-only queries are a thing that Druid supports.
   4. `CalciteWindowQueryTest` leverages the recent changes that brought us the 
`QueryTestBuilder` to have fully file-driven tests.  For the window functions, 
we needed more data, so I've moved the wikipedia dataset that was checked in as 
part of `environment` to be in the `resources` of the test jar.  We then index 
that and reference it in the tests.  The tests sit in `calcite/tests/window` of 
resources, each file is a SQL query, the window Operator structure that we 
expect to be built from it and the expected results.  Hopefully, this will 
simplify the addition of test cases making it easier for someone to add a test 
without necessarily knowing how to work with the code and fix it.
   
   That said, this PR introduces various things, but it is not complete yet.  
This is a first step and there are still sharp edges/unimplemented 
functionality.  That said, what exists here does work for a subset of use cases 
and is a meaningful milestone that can be committed while we iterate on 
fleshing out and finishing up the functionality.  As such, I'd like to get this 
reviewed and committed before making it even larger.  All interfaces introduced 
in this PR are experimental and the "windowOperator" query is also intended to 
be experimental.  Given that this is still experimental, I am intending to 
merge this PR as an undocumented feature, which we will document better as we 
get the sharp edges resolved.
   
   #### Sharp Edges
   
   1. The window function support does not yet support "frames" (`ROWS BETWEEN 
2 PRECEEDING AND 2 FOLLOWING` style clauses).
   2. The support is not yet fully aware of the difference between `RANGE` and 
`ROWS` when evaluating peers.  (The built-in functions are all implemented with 
the correctly defined semantics, similar to what the postgresql document says 
above)
   3. All window functions in one query must use the same windowing definition 
(the code cannot currently support 2 different `PARTITION BY X` clauses)
   4. The windowing logic will not re-sort the data.  It assumes that the 
sub-query was written such that data is pre-sorted in the way that the 
windowing logic expects.
   
   These sharp edges are in the weeds enough that this support should still be 
considered experimental and it should exist as an undocumented feature.  
Subsequent PRs will smooth out these sharp edges.
   
   #### Release note
   None as the intent is for this to be an undocumented, experimental addition.
   
   This PR has:
   
   - [ ] been self-reviewed.
   - [ ] added documentation for new or modified features or behaviors.
   - [ ] added Javadocs for most classes and all non-trivial methods. Linked 
related entities via Javadoc links.
   - [ ] added comments explaining the "why" and the intent of the code 
wherever would not be obvious for an unfamiliar reader.
   - [ ] added unit tests or modified existing tests to cover new code paths, 
ensuring the threshold for [code 
coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md)
 is met.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to