imply-cheddar commented on PR #14232:
URL: https://github.com/apache/druid/pull/14232#issuecomment-1596495350

   There are requests for a design doc.  The description of this PR attempts to 
lay out the explanation (though it's very terse and doesn't actually do it in a 
way that is consumable to anybody who isn't involved in the work).
   
   The basic part of this PR is that it is
   
   1) Introducing a new way of planning SQL.  This is introduced as yet-another 
method of planning and doesn't impact the current methods of planning
   2) You can opt-in to the new method of planning via setting a context 
parameter, but it's also very verbose in terms of logging and has queries that 
are actually plannable, but the new way just doesn't yet know how to plan them
   
   The new method "decouples" planning of native queries into a 2-phase process
   
   1. SQL -> DAG
   2. DAG -> native query
   
   The DAG is returned from the SQL planner and then that DAG is used to build 
the native query.
   
   The reason that we are doing this is to make it easier to iterate and work 
with SQL for Druid.  Druid's native query planning follows a relatively 
uncommon pattern of building up the physical execution plan as the output of 
the volcano planner.  On the plus side, this was done so that we could leverage 
the volcano planner to explore different methods of physical planning in case 
something wasn't natively plannable.  On the negative side, this makes it 
incredibly difficult to actually work with the SQL planning as it exists in 
Druid.  This is seen through the following symptoms:
   
   1) There is a class of errors that it is nigh-impossible to generate good 
error messages for.  This is because an error could just mean "volcano planner, 
explore something else" or it could mean "user person, you did something wrong, 
re-write your query".  The fundamental source of these poor error codes is that 
we have coupled the physical execution conversion with the planning, but the 
current code tries to work around this by allowing you to set a possible error, 
which the planner will surface.  This "possible error" basically means that we 
show users an error message which is the last error message that caused the 
volcano planner to stop, and might or might not actually be related to the real 
issue.  In fact, a lot of the time, this error is not related to the actual 
issue at all.  Instead of working around the root cause (the coupling of the 
planning with the building of the execution), we should decouple things
   2) When a query fails to plan, but we think it should be plannable, it is 
extremely difficult to figure out if it is an issue with the rules (i.e. we 
need a rule that orders nodes differently) OR if it is an issue with the 
physical query planning (i.e. something in `PartialDruidQuery` and peripheral 
classes needs update).  The "normal" debugging process for this would be, (1) 
is the SQL turning into a "good" DAG?  If yes, figure out why that "good" DAG 
is not being converted, if no, figure out why the SQL is not turning into a 
"good" DAG.  This debugging process is, in fact, what we normally follow, we 
just have to do it by setting a breakpoint in the volcano planner code and 
inspect all of the various different plans that it explores to figure out if 
there's a good one in there or not.  Instead of needing to fire up a debugger 
to do the first-degree split, we should be able to do it much more directly.  
That is, the normal flow of the code could be SQL -> DAG -> native query, wit
 h an object representing each of these., which is exactly  what the decoupling 
is accomplishing.
   3) It is very difficult to identify and understand what patterns of logical 
DAG are not runnable by native Druid queries.  Given that the current code 
leverages "I cannot run that" as a method of telling the planner to explore 
other things, it's not abundantly clear which combinations of DAG elements lead 
to a failure that can be successfully planned by re-arranging things VS. which 
combinations of DAG elements lead to a failure that will never be converted.  
This makes it quite difficult to inventory the things that Druid's native 
queries can and cannot do when it comes to running SQL.  The decoupling effort 
eliminates this because the first pass of SQL -> DAG should be able to happen 
successfully regardless of whether a Druid native query can be generated.  When 
the generated DAG gets converted to a native query, the code that builds the 
native query will be able to understand and identify that it's not possible to 
convert that specific set of DAG elements and generate an error 
 that explains why.  So, what happens when we get an error that says we cannot 
plan a specific query, but we think that it should be plannable?  We follow the 
normal debug procedure: we evaluate the generated DAG and convince ourselves 
that it is a "good" DAG given that input SQL, if it was a "bad" DAG, we figure 
out what rules will help us generate a "good" DAG, if it is a "good" DAG, we 
figure out how to change the conversion logic so that it stops generating an 
exception.
   
   The change itself is not impacting the current behavior and is being done as 
an add-on.  Once we have it passing all of the existing tests, then we will be 
able to switch the default planning mode over to it and deprecate the old 
native query planning.  Given that this isn't actually changing behaviors and 
that it's not fundamentally changing any sort of public API, I believe it 
should be safe to iterate on this work quite rapidly. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to