[GitHub] [druid] paul-rogers commented on issue #12262: Multi-stage distributed queries

GitBox Thu, 14 Jul 2022 16:14:00 -0700


paul-rogers commented on issue #12262:
URL: https://github.com/apache/druid/issues/12262#issuecomment-1184982176


   > The monolithic native queries will not serve us well in a world where 
queries and dataflow can get more complex.
   
   Just to provide a bit of context for others reading this, the "monolithic 
native queries" point is key. Today, Druid *is* a DAG of operations, those 
operations just tend to be rather large: let's call them "query blocks". We can 
name three: the native scan query block, the native group-by query block, and a 
third one we'll call the "broker query block". Every Druid query is a DAG of at 
least two of these:
   
   ```text
   broker query block
   |
   scan query block
   ```
   
   Each block does a wide variety of tasks. The scan query block scans 
segments, pushing filters down though segment selection and segment indexes and 
bitmaps. The group-by query block does this as well, but also does first-stage 
aggregation. The broker query block is mostly about merging: it distributes 
work, merges results, handles missing segments, does sorts, applies offsets and 
limits, etc.
   
   Since the broker block is used atop both the scan and group by blocks, it 
must handle operations and data formats specific to each of those blocks. 
That's what the `QueryToolChest` does: it says, "hey, when you do a merge, do 
it such-and-so way".
   
   The SQL-to-native-query translation step that @gianm mentioned maps the 
fine-grain Calcite relational operations into groups of operations assigned to 
native or broker query blocks, by way of defining one or more native queries. 
Since Druid has only scatter/gather, when a query has more than two blocks, the 
broker handles all but the leaf blocks. A distributed model would push blocks 
to other nodes. This is, roughly, what the MSQE controller does: it distributes 
native query blocks, wrapped in frame processors.
   
   We could build a distributed query just using query blocks. Suppose we 
wanted to do 11 10-way merges instead of a single 100-way merge. We can use two 
levels of broker blocks:
   
   ```text
   broker query block (root, 1 of these)
   |
   broker query block (10 of these)
   |
   scan query block (100 of these)
   ```
   
   Query blocks do "just in time planing." The broker says, "hey, top-level 
query runner, execute this query." That query runner says, "well, I know how to 
limit, so I'll do that, and I'll ask the next query runner to run the rest of 
the query." The next guy knows how to merge but not get data. The next can 
handle retries, but not distribute. The guy below that knows how to distribute, 
and so on. This is a very clever and unusual approach: it attaches planning to 
each "operator". It avoids the need for a big Calcite-like planner. But, it 
means we can only make decisions based on local information: we can't, say, 
choose to reorder a join tree based on metadata. If we need global-level 
planning, we need a planner/optimizer to do the work. (Though, it would be cool 
if we could keep the local planning where it makes sense.)
   
   So, one model going forward is to keep the existing native query blocks, 
plan based on those blocks, distribute these blocks.  This is, in fact, what 
Druid does today. However, as @gianm observed, that "will not serve us well in 
a world where queries and dataflow can get more complex." The reason is that, 
it becomes increasingly hard to map a Calcite DAG of relational operators to an 
efficient set of native query blocks: it is a force-fit. The blocks do more 
than we need (adding overhead) and limit expressiveness. In short, we're trying 
to execute the classic DAG of relational operators with query blocks designed 
for a different use case (the original Druid native queries.)
   
   The discussion about operators points out that the native blocks (aside from 
segment scans) are, in fact, made up of operations implemented as query runners 
and sequences. We simply disaggregate those blocks into their component 
operators, wrap that code in a composable form (operators), and then map more 
directly from the Calcite logical plan to a distributed physical plan made up 
of operators. We know this works well because dozens of DBs (including Oracle, 
SqlServer, Teradata, Spark and other industry leaders) work this way.
   
   The key question is: for the goals we want to hit, do we actually need the 
flexibility of the "standard" approach? Or, can we achieve our goals with less 
cost and risk by adding a shuffle layer between native query blocks?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] paul-rogers commented on issue #12262: Multi-stage distributed queries

Reply via email to