imply-cheddar opened a new pull request, #13416: URL: https://github.com/apache/druid/pull/13416
### Description This commit adds support for Window functions to Druid. The intent is to support window functions similarly to what postgresql does (https://www.postgresql.org/docs/current/functions-window.html) as well as Drill (https://drill.apache.org/docs/sql-window-functions-introduction/) The commit is large and there is more work to do, but this lays the foundation of supporting window functions. This PR is best read from a few entry points: 1. The windowing functions are implemented as `Processor`s. Looking at this interface will give an idea of how the actual window functions are implemented. The biggest notion here is that `Processor`s deal with a `RowsAndColumns` object that represents a full partition. Generally speaking, they believe that something else has prepared the `RowsAndColumns` for them which proves to be a nice simplifying assumption for their implementation. 2. `RowsAndColumns` is an interface that represents a set of... rows and columns. If you look at the interface, it is rather minimalist. The idea here is to lean into the idea of `.as()` as has existed on Segments for a long time. We use it to effectively build a "menu" of common, generic functionality that can be done on a batch of data and then Operators/Processors can be written in terms of that common, generic functionality. The implementation in this PR provides naive implementations of this functionality, as we get deeper in future PRs, this functionality will get fleshed out and specialized more to avoid object copies and megamorphism while further offering love for vectorized processing. 3. `Operator` is an interface introduced here. `WindowOperatorQuery` is an operator-defined query for running Window operations on top of the results of a sub-query. All of the logic for the operators is handled in `WindowOperatorQueryQueryToolChest.merge` right now. Essentially, this PR has been co-opted to get Operators introduced into the Druid code base. This means we are using window queries as an initial jumping off point for `Operators` to be introduced into the code flow. We expect more and more iteration on this to expand the capabilities deeper and perhaps make a future world where Operator-only queries are a thing that Druid supports. 4. `CalciteWindowQueryTest` leverages the recent changes that brought us the `QueryTestBuilder` to have fully file-driven tests. For the window functions, we needed more data, so I've moved the wikipedia dataset that was checked in as part of `environment` to be in the `resources` of the test jar. We then index that and reference it in the tests. The tests sit in `calcite/tests/window` of resources, each file is a SQL query, the window Operator structure that we expect to be built from it and the expected results. Hopefully, this will simplify the addition of test cases making it easier for someone to add a test without necessarily knowing how to work with the code and fix it. That said, this PR introduces various things, but it is not complete yet. This is a first step and there are still sharp edges/unimplemented functionality. That said, what exists here does work for a subset of use cases and is a meaningful milestone that can be committed while we iterate on fleshing out and finishing up the functionality. As such, I'd like to get this reviewed and committed before making it even larger. All interfaces introduced in this PR are experimental and the "windowOperator" query is also intended to be experimental. Given that this is still experimental, I am intending to merge this PR as an undocumented feature, which we will document better as we get the sharp edges resolved. #### Sharp Edges 1. The window function support does not yet support "frames" (`ROWS BETWEEN 2 PRECEEDING AND 2 FOLLOWING` style clauses). 2. The support is not yet fully aware of the difference between `RANGE` and `ROWS` when evaluating peers. (The built-in functions are all implemented with the correctly defined semantics, similar to what the postgresql document says above) 3. All window functions in one query must use the same windowing definition (the code cannot currently support 2 different `PARTITION BY X` clauses) 4. The windowing logic will not re-sort the data. It assumes that the sub-query was written such that data is pre-sorted in the way that the windowing logic expects. These sharp edges are in the weeds enough that this support should still be considered experimental and it should exist as an undocumented feature. Subsequent PRs will smooth out these sharp edges. #### Release note None as the intent is for this to be an undocumented, experimental addition. This PR has: - [ ] been self-reviewed. - [ ] added documentation for new or modified features or behaviors. - [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links. - [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader. - [ ] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for [code coverage](https://github.com/apache/druid/blob/master/dev/code-review/code-coverage.md) is met. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
