[GitHub] [druid] gianm commented on issue #12262: Multi-stage distributed queries

GitBox Wed, 13 Jul 2022 15:08:47 -0700


gianm commented on issue #12262:
URL: https://github.com/apache/druid/issues/12262#issuecomment-1183728238


   > The way we achieved these goals in other systems (Drill, Presto, Impala, 
Hive, etc.) is to use the classic optimizer + operator DAG architecture. The 
optimizer allows queries to handle a wide variety of use cases (anything that 
can be expressed in SQL): scans, groupings, joins, windowing, grouping over 
joins over scans topped by windowing, etc. The optimizer (AKA "planner") 
handles distribution: whether that be classic Druid scatter-gather or a more 
general multi-stage plan. The optimizer also handles tasks such as data 
locality, resource balancing, etc.
   
   @paul-rogers imo, this is absolutely where we want to go. The monolithic 
native queries will not serve us well in a world where queries and dataflow can 
get more complex. It seems clear that the ideal structure is, as you say, the 
standard approach of a tree of fine-grained operators that can be manipulated 
by the planner and then distributed and run by the execution engine. The main 
question to me is how we get there.
   
   It's helpful to zoom out and look at the query stack as a whole. Today's 
query stack has four main pieces.
   
   **First:** SQL-to-native-query translation, which happens in a Calcite based 
planner using a collection of Calcite rules that convert logical operators into 
native queries. They use a "collapsing" strategy, where a tree full of 
Calcite's fine-grained builtin logical operators is collapsing into monolithic 
native query nodes one-by-one. (One rule turns a Scan operator into a Druid 
query; the next rule collapses that Druid query node with a Filter node into a 
new Druid query node; etc.)
   
   **Second:** Scatter on the Broker. This is managed mainly by 
CachingClusteredClient. It scatters the query, by creating subqueries for each 
data server referencing the specific segments it wants those data servers to 
process.
   
   **Third:** Segment fan-out on the data servers. Each data server uses 
`QueryRunnerFactory.createRunner` to create a runner for each segment, and 
`QueryRunnerFactory.mergeRunners` to create a merged result stream from those 
runners.
   
   **Fourth:** Gather on the Broker. This is also managed mainly by 
CachingClusteredClient. It collects sorted streams of results from each data 
server, then merges them into a single stream using `Query.getResultOrdering` 
(to do an n-way merge of sorted results) and `QueryToolChest.mergeResults` (to 
combine rows when appropriate, and do follow-on processing such as projections, 
having filters, post-aggregation ordering, and limit).
   
   To me the key question is: what here can we keep as-is, what can be evolved, 
and what should be implemented fresh?
   
   🔥 takes:
   
   - We can't drop support for the current monolithic native queries, which 
means we need to either retain the execution system that runs them, or we need 
to be able to convert them to operator trees that can run just-as-well. It 
seems to me that the latter (convert native to operators, then run operators) 
is the right long term play. But it also seems to me that we'll want to start 
with having two execution systems. These two systems should share as much code 
as feasible, so we aren't _literally_ maintaining two completely separate 
systems. (To enable this code sharing, we can use a strategy like the 
QueryRunner evolution plan that @paul-rogers has been talking about.) 
   
   - It seems to me that to do multi-stage low-latency queries, 
CachingClusteredClient will need to be replaced: its structure is too tightly 
coupled to the scatter/gather concept. I don't think its replacement will share 
much code with the current CachingClusteredClient, because 
CachingClusteredClient does everything on Sequences of Java objects, and we'll 
want the new stuff to work on frame channels.
   
   - My guess is the QueryToolChests and QueryRunnerFactories interfaces also 
don't get carried through to the operator-tree execution system, but that a lot 
of the stuff they do internally does get carried through. Something like: we'll 
have QueryToolChest and QueryRunnerFactory for the current monolithic queries, 
some new thing for operator-based queries, and internally they will share a 
bunch of the same code. I'm not sure how frames fit into this. If they're going 
to be used for local processing (not just for IPC) then this suggests a broad 
move away from Sequences.
   
   - We'll want the SQL planner to be able to generate operator trees rather 
than trees of monolithic native queries. It seems to me that the long term play 
is to go to 100% operator trees, but we're likely to start by having the 
planner be able to generate either one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@druid.apache.org
For additional commands, e-mail: commits-h...@druid.apache.org

[GitHub] [druid] gianm commented on issue #12262: Multi-stage distributed queries

Reply via email to