paul-rogers commented on PR #13187: URL: https://github.com/apache/druid/pull/13187#issuecomment-1276979486
There was some offline discussion about isomorphisms between this PR and the existing design. For the benefit of other readers. per the _original_ concept of query runners: | Existing | This PR | Description | | -------- | ------- | ---------- | | Lots of code | Query planners | Decides what is to be done for a given query. | | `QueryRunner` | `Operator` | Does one task in a query pipeline & returns results. | | `Sequence` | `ResultIterator` | Mechanism to obtain the results. | Given how the code evolved into its current state: | Existing | This PR | Description | | -------- | ------- | ---------- | | `QueryRunner` | Query planners | Decides what is to be done for a given query. | | `Sequence` | `Operator` | Does one task in a query pipeline & returns results. | | `Sequence` | `ResultIterator` | Mechanism to obtain the results. | The key notion is that, in the present PR, we separate the task of "what to do" with "go do it", while in the existing code, these two are often combined. A key reason to split the tasks is to improve testability and reusability. Operators that don't decide what do to, but just do one thing well, can be more easily composed into a large variety of query shapes. `QueryRunner`s and `Sequence`s may have started out this way, but today they tend to be tightly coupled to their context and to one another. One other difference is that a `Sequence` is reluctant to provide its contents: it wants to do the aggregation for its "downstream" consumer. This couples the implementation of the adjacent `Sequence`s: the upstream one has to be able to implement what the downstream needs. `Yielder`s can coerce a `Sequence` into coughing up individual rows, which is what often happens in practice. By contrast, the `Operator` abstraction makes a sharper split: an `Operator` produces a result (usually a batch) and has no desire to know what the downstream operator does with those results. Similarly, a downstream operator says, "just show me the data, baby!" It doesn't care how the batch of rows was produced. Usual arguments apply for testability, modularity and reusability. Perhaps the goal of a `Sequence` was to avoid transferring any more data than necessary: transfer only the aggregates. This is ideal across the network. In memory: there is no "transfer cost", just a pointer changing hands. So, it does not matter which side of the line the aggregation is done on. (With the obvious exception of pushing things down into segments whenever possible.) This lets aggregators be aggregators, and other operators just do their own jobs, without responsibilities smearing across boundaries. For the network case, yes, put the aggregate operator on the sender side of the exchange. But, it's still just an operator. One other thought, I can't take the credit (or blame) for this idea or naming. The "operator" name comes from "relational operator" in the relational calculus that Codd invented way back when. The operator structure has been around since at least the Volcano paper. All we're doing here is borrowing good ideas so we don't have to reinvent the wheel. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
