Hi Sergey!

Thanks for the very informative email!

On Wed, Aug 5, 2009 at 6:58 AM, Sergey Petrunya<[email protected]> wrote:
> Hi!
>
> On Tue, Aug 04, 2009 at 03:29:34AM -0400, Padraig O'Sullivan wrote:
>> So I was looking at the optimizer code a little bit tonight and came
>> across something I'm not sure about. Basically, I'm wondering why is
>> there a need for the POSITION struct in the optimizer?
>>
>> >From what I can understand, an array of POSITION objects is used to
>> represent a query plan during optimization and once an optimal query
>> plan is chosen, this array of POSITION objects is copied into an array
>> of JoinTable objects (in get_best_combination()). Then during
>> execution time, the array of JoinTable objects represents the query
>> execution plan.
>>
>> What I'm wondering is why not just use an array of JoinTable objects
>> during both optimization and execution? If you look at the POSITION
>> structure, it actually contains a pointer to a JoinTable object with
>> just a few additional members. In theory, could we just extend
>> JoinTable with the members from POSITION and remove the POSITION
>> structure?
>>
>> Is there something I am missing here (very likely as I'm just starting
>> to look at the optimizer)? Is there some specific reason things are
>> designed like this?
> I normally work on MySQL and MariaDB (and not Drizzle), so there may be
> certain differences between those and drizzle, but overall it is as follows
> (I'll be using MySQL/MariaDB variable names, sorry):
>
> Current state
> -------------
> JOIN_TAB (aka JoinTable) represents a table in a join. It is actually a mix
> of two kinds of members:
>
> J1. A part that makes sense before/without join order. For example:
>  KEYUSE *keyuse - *all* possible ways one could construct 'ref' access.
>  Different ref accesses have different requirements on what the join order
>  must look like.
>
>  QUICK_SELECT_I *quick - the best way to read the table using range or
>  index_merge access (doesnt make any assumptions about join order)
>
>  ... and so forth.
>
> J2. A part that makes sense only after you've picked join order. Those are
>  set in get_best_combination() and after that (e.g. in make_join_readinfo()):
>
>  enum join_type type;
>  TABLE_REF ref;
>  JOIN_CACHE cache;
>  ...and so forth.
>
> There is a 1<-> mapping between TABLE* structures and JOIN_TABs (reachable 
> both
> ways via join_tab->table and table->reginfo.join_tab)
>
> POSITION, on the other hand, is a position of table within a join order. It 
> has
> only members that are being considered when doing join optimization.
>
> There are two POSITION arrays:
>  - join->positions[] holds the prefix of the join order that we're currently
>   considering.
>
>  - join->best_positions[] holds description of the best picked join order so
>   far. After join optimization has finished, it holds the picked join order.
>
> POSITIONs do not have members like "JOIN_CACHE cache" because decision about
> whether to use a join cache is made after we've picked the join order.
>
> Discussion
> ----------
> IMO current scheme has some drawbacks:
>  - JOIN_TAB has two kinds of members, and they are not clearly delimited.
>  - Final execution plan is spread across two arrays of structures -
>   join->join_tab[i] and join->best_positions[].
>
> One benefit is that there is few data duplication.
>
> With regards to the suggestion to merge POSITIONs into JOIN_TABs and then
> operate only on JOIN_TABs everywhere:
>
> - join optimizer will have to work on a bigger set (I have 
> sizeof(JOIN_TAB)=392,
>  sizeof(POSITION)=32. I'm not sure whether that will have a noticeable impact.
>
> - the mess that comes from mixing J1 and J2 in the same struct will be
>  exacerbated, because join optimizer code will start to modify JOIN_TAB
>  structures instead of POSITION structures.
>
> - the benefits are not apparent to me. There will be one less struct. So what,
>  structs are free for compiler, and their use adds more, well, structure.

I did not have any benefit in mind when I asked originally; I was just
curious :)

>
> As an alternative, I'd suggest:
>
> - Separate J1 and J2 parts into two structures:
>  1. J1 is for various pre-join optimization attributes
>  2. J2 is exclusively for picked join order.
>
> - Make get_best_combination() (or other similar function) copy all needed
>  information from join->best_positions[..] to array of J2 data structures
>  and make nobody touches best_positions.
>  That way, we'll have that we have the join plan encoded in one array (and not
>  two), and a clearer separation between transient join optimization data and
>  the result of join optimization.
>
> Any thoughts?

This is a nice idea. We definitely plan on having a class in drizzle
for representing the final query execution plan (probably called
Plan!). It seems like this class would encapsulate the array
(hopefully std::vector in drizzle :) of J2 structure's you mentioned.

We are just starting to look at a lot of the optimizer code and so are
a good bit away from this. Our first step is just trying to make the
code more object-orientated and documenting ideas on tasks to be
performed in this area. As an example of the tasks we are doing in
this area right now, we just made the best_positions and positions
arrays private members of the JOIN class and provided accessors for
these members. Besides these small tasks, I'm just reading through the
optimizer code right now.

Thanks again for your input! I am sure there will be lots more
questions from me in this area in the future!

-Padraig

>
> BR
>  Sergey
> --
> Sergey Petrunia, Software Developer
> Monty Program AB, http://askmonty.org
> Blog: http://s.petrunia.net/blog
>

_______________________________________________
Mailing list: https://launchpad.net/~drizzle-discuss
Post to     : [email protected]
Unsubscribe : https://launchpad.net/~drizzle-discuss
More help   : https://help.launchpad.net/ListHelp

Reply via email to