[GitHub] [druid] paul-rogers commented on pull request #13187: Convert two native queries to use operators

GitBox Wed, 12 Oct 2022 20:30:34 -0700


paul-rogers commented on PR #13187:
URL: https://github.com/apache/druid/pull/13187#issuecomment-1276979486


   There was some offline discussion about isomorphisms between this PR and the 
existing design. For the benefit of other readers. per the _original_ concept 
of query runners:
   
   | Existing | This PR | Description |
   | -------- | ------- | ---------- |
   | Lots of code | Query planners | Decides what is to be done for a given 
query. |
   | `QueryRunner` | `Operator` | Does one task in a query pipeline & returns 
results. |
   | `Sequence` | `ResultIterator` | Mechanism to obtain the results. |
   
   Given how the code evolved into its current state:
   
   | Existing | This PR | Description |
   | -------- | ------- | ---------- |
   | `QueryRunner`  | Query planners | Decides what is to be done for a given 
query. |
   | `Sequence` | `Operator` | Does one task in a query pipeline & returns 
results. |
   | `Sequence` | `ResultIterator` | Mechanism to obtain the results. |
   
   The key notion is that, in the present PR, we separate the task of "what to 
do" with "go do it", while in the existing code, these two are often combined. 
A key reason to split the tasks is to improve  testability and reusability. 
Operators that don't decide what do to, but just do one thing well, can be more 
easily composed into a large variety of query shapes. `QueryRunner`s and 
`Sequence`s may have started out this way, but today they tend to be tightly 
coupled to their context and to one another.
   
   One other difference is that a `Sequence` is reluctant to provide its 
contents: it wants to do the aggregation for its "downstream" consumer. This 
couples the implementation of the adjacent `Sequence`s: the upstream one has to 
be able to implement what the downstream needs. `Yielder`s can coerce a 
`Sequence` into coughing up individual rows, which is what often happens in 
practice.
   
   By contrast, the `Operator` abstraction makes a sharper split: an `Operator` 
produces a result (usually a batch) and has no desire to know what the 
downstream operator does with those results. Similarly, a downstream operator 
says, "just show me the data, baby!" It doesn't care how the batch of rows was 
produced. Usual arguments apply for testability, modularity and reusability.
   
   Perhaps the goal of a `Sequence` was to avoid transferring any more data 
than necessary: transfer only the aggregates. This is ideal across the network. 
In memory: there is no "transfer cost", just a pointer changing hands. So, it 
does not matter which side of the line the aggregation is done on. (With the 
obvious exception of pushing things down into segments whenever possible.) This 
lets aggregators be aggregators, and other operators just do their own jobs, 
without responsibilities smearing across boundaries. For the network case, yes, 
put the aggregate operator on the sender side of the exchange. But, it's still 
just an operator.
   
   One other thought, I can't take the credit (or blame) for this idea or 
naming. The "operator" name comes from "relational operator" in the relational 
calculus that Codd invented way back when. The operator structure has been 
around since at least the Volcano paper. All we're doing here is borrowing good 
ideas so we don't have to reinvent the wheel.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] paul-rogers commented on pull request #13187: Convert two native queries to use operators

Reply via email to