[GitHub] [druid] paul-rogers commented on issue #12262: Multi-stage distributed queries

GitBox Thu, 14 Jul 2022 10:21:46 -0700


paul-rogers commented on issue #12262:
URL: https://github.com/apache/druid/issues/12262#issuecomment-1184702375


   > To me the key question is: what here can we keep as-is, what can be 
evolved, and what should be implemented fresh?
   
   @gianm, thanks for the breakdown of areas! To answer the above question, I'd 
ask a different one: what are we trying to solve? We only need to change that 
part.
   
   To answer that, let's do a thought experiment. Suppose we are Pinot, and we 
use Presto to do non-trivial queries. What would that look like? We'd give to 
Presto enough information that Presto could distribute segment reads to its 
workers. Each worker would reach out to a historical with a native query. 
Presto would try to push as much work as possible into the native query so that 
Presto receives the minimal amount of data from each historical. That requires 
work in the Presto planer to work out distribution (which historicals have 
which segments, how to distribute that work to Presto nodes, and what can be 
pushed down to the historicals.) 
   
   The Presto leaf operators would send native queries to historicals to do the 
"scan" work. From there, Presto would implement the rest of the query: 
grouping, joins, windowing, and so on.
   
   Apache Drill (and I think Presto?) have Druid connectors, but the Drill one 
uses JDBC, which single-threads the reads through the Broker. Would be cool to 
add the above logic to the Druid connectors for Drill and Presto so we could 
see how the leading query tools would handle Druid queries. Spark could do the 
same, but for batch operations.
   
   The Druid solution, of course, will do it all, but we can use the same 
approach. The planner works out the distribution model, and distributes work to 
historicals. Rather than having the Broker work with historicals directly, we 
can have a tier of worker nodes do that work, aggregate or merge results, 
shuffle to other nodes, etc. We'd do that new work if we want to support 
queries beyond scatter/gather. (Simple queries reduce to scatter/gather: the 
broker does the work when it is a simple merge.)
   
   The key point here is that, even with a DAG, ultimately we want to push as 
much work as possible onto the historical: filtering, per-segment aggregation, 
etc. We already have native query code that does that. So, expediency says we 
just create a physical plan that has leaf nodes that hold the per-historical 
(or per-segment) native query that we then ship to the historical (using REST 
or some fancy new networking solution).
   
   This means that the part we'd want to change is the distributed portion of 
the query engine, not the part that runs on data nodes. Since the query 
runner/sequence structure isn't a good fit for a distributed DAG, we can 
instead refactor that code into an optimizer and operators, with a distribution 
framework. Crucially, the data nodes need not change: they don't need to care 
if they are talking to the "classic" scatter/gather model, Presto, Drill, 
Spark, or the new multi-tier engine that happens to use an optimizer and 
operator DAG. 
   
   The above was validated in the operator prototype mentioned earlier: one 
variation refactored the Broker to be operator based, and sent native queries 
to the historicals as a form of pushing operations into the input source, just 
like we do in Drill.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] paul-rogers commented on issue #12262: Multi-stage distributed queries

Reply via email to