Hi Hanu,
In the plan you provided (simplified version below)
DrillScreenRel
DrillProjectRel(**=[$0], **0=[$3])
DrillSemiJoinRel(condition=[=($1, $6)], joinType=[inner])
DrillJoinRel(condition=[=($2, $5)], joinType=[inner])
DrillScanRel(table=[[dfs, /home/mapr/data/sf1/parquet/web_sales]])
DrillScanRel(table=[[dfs, /home/mapr/data/sf1/parquet/web_sales]])
DrillScanRel(table=[[dfs,/home/mapr/data/sf1/parquet/date_dim]])
it seems valid to change the order of the semi-join with the join. However,
I don't think the current code is able to perform this re-ordering.
If you want to work on this the best place to start would be
JoinToMultiJoinRule and LoptOptimizeJoinRule.
Best,
Stamatis
Στις Τρί, 15 Ιαν 2019 στις 9:01 μ.μ., ο/η hanu mapr <[email protected]>
έγραψε:
> Thank you Julian and Stamatis for the clarification. While debugging I
> observed that in MultiJoin operator there were only two tables
> participating in the join ordering for the below query.
>
> Here is the query which I could see a difference in the plan (in Drill).
>
> 0: jdbc:drill:zk=local> alter session set `planner.enable_semijoin` = true;
> +-------+-----------------------------------+
> | ok | summary |
> +-------+-----------------------------------+
> | true | planner.enable_semijoin updated. |
> +-------+-----------------------------------+
> 1 row selected (0.073 seconds)
> 0: jdbc:drill:zk=local> select * from
> dfs.`/home/mapr/data/sf1/parquet/web_sales` ws1,
> dfs.`/home/mapr/data/sf1/parquet/web_sales` ws2 where
> ws1.ws_ship_date_sk in (select dd.d_date_sk from
> dfs.`/home/mapr/data/sf1/parquet/date_dim` dd) and ws1.ws_order_number
> = ws2.ws_order_number;
>
> +----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+
> | text
> | json
> |
>
> +----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+
> | DrillScreenRel
> DrillProjectRel(**=[$0], **0=[$3])
> DrillSemiJoinRel(condition=[=($1, $6)], joinType=[inner])
> DrillJoinRel(condition=[=($2, $5)], joinType=[inner])
> DrillScanRel(table=[[dfs,
> /home/mapr/data/sf1/parquet/web_sales]], groupscan=[ParquetGroupScan
> [entries=[ReadEntryWithPath
> [path=file:/home/mapr/data/sf1/parquet/web_sales]],
> selectionRoot=file:/home/mapr/data/sf1/parquet/web_sales, numFiles=1,
> numRowGroups=1, usedMetadataFile=false, columns=[`**`,
> `ws_ship_date_sk`, `ws_order_number`]]])
> DrillScanRel(table=[[dfs,
> /home/mapr/data/sf1/parquet/web_sales]], groupscan=[ParquetGroupScan
> [entries=[ReadEntryWithPath
> [path=file:/home/mapr/data/sf1/parquet/web_sales]],
> selectionRoot=file:/home/mapr/data/sf1/parquet/web_sales, numFiles=1,
> numRowGroups=1, usedMetadataFile=false, columns=[`**`,
> `ws_ship_date_sk`, `ws_order_number`]]])
> DrillScanRel(table=[[dfs,
> /home/mapr/data/sf1/parquet/date_dim]], groupscan=[ParquetGroupScan
> [entries=[ReadEntryWithPath
> [path=file:/home/mapr/data/sf1/parquet/date_dim]],
> selectionRoot=file:/home/mapr/data/sf1/parquet/date_dim, numFiles=1,
> numRowGroups=1, usedMetadataFile=false, columns=[`d_date_sk`]]])
>
> 0: jdbc:drill:zk=local> alter session set `planner.enable_semijoin` =
> false;
> +-------+-----------------------------------+
> | ok | summary |
> +-------+-----------------------------------+
> | true | planner.enable_semijoin updated. |
> +-------+-----------------------------------+
> 1 row selected (0.077 seconds)
> 0: jdbc:drill:zk=local> select * from
> dfs.`/home/mapr/data/sf1/parquet/web_sales` ws1,
> dfs.`/home/mapr/data/sf1/parquet/web_sales` ws2 where
> ws1.ws_ship_date_sk in (select dd.d_date_sk from
> dfs.`/home/mapr/data/sf1/parquet/date_dim` dd) and ws1.ws_order_number
> = ws2.ws_order_number;
>
> +----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+
> | text
> | json
> |
>
> +----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+
> | DrillScreenRel
> DrillProjectRel(**=[$0], **0=[$3])
> DrillProjectRel(**=[$0], ws_ship_date_sk=[$1],
> ws_order_number=[$2], **0=[$4], ws_ship_date_sk0=[$5],
> ws_order_number0=[$6], d_date_sk=[$3])
> DrillJoinRel(condition=[=($2, $6)], joinType=[inner])
> DrillJoinRel(condition=[=($1, $3)], joinType=[inner])
> DrillScanRel(table=[[dfs,
> /home/mapr/data/sf1/parquet/web_sales]], groupscan=[ParquetGroupScan
> [entries=[ReadEntryWithPath
> [path=file:/home/mapr/data/sf1/parquet/web_sales]],
> selectionRoot=file:/home/mapr/data/sf1/parquet/web_sales, numFiles=1,
> numRowGroups=1, usedMetadataFile=false, columns=[`**`,
> `ws_ship_date_sk`, `ws_order_number`]]])
> DrillAggregateRel(group=[{0}])
> DrillScanRel(table=[[dfs,
> /home/mapr/data/sf1/parquet/date_dim]], groupscan=[ParquetGroupScan
> [entries=[ReadEntryWithPath
> [path=file:/home/mapr/data/sf1/parquet/date_dim]],
> selectionRoot=file:/home/mapr/data/sf1/parquet/date_dim, numFiles=1,
> numRowGroups=1, usedMetadataFile=false, columns=[`d_date_sk`]]])
> DrillScanRel(table=[[dfs,
> /home/mapr/data/sf1/parquet/web_sales]], groupscan=[ParquetGroupScan
> [entries=[ReadEntryWithPath
> [path=file:/home/mapr/data/sf1/parquet/web_sales]],
> selectionRoot=file:/home/mapr/data/sf1/parquet/web_sales, numFiles=1,
> numRowGroups=1, usedMetadataFile=false, columns=[`**`,
> `ws_ship_date_sk`, `ws_order_number`]]])
>
>
> I was actually looking for a plan with DrillSemiJoinRel pushed down when
> the semi join is enabled.
> Thank you for sharing the paper.
>
> Regards,
> -Hanu
>
>
> On Mon, Jan 14, 2019 at 5:18 AM Stamatis Zampetakis <[email protected]>
> wrote:
>
> > @Hanu: Can you clarify a bit more where is the problem?
> > From a quick loon in LoptOptimizeJoinRule it seems that all joins to be
> > re-ordered are grouped using a MultiJoin operator.
> > Have you checked that the SemiJoin in question is part of the MultiJoin
> > operator?
> >
> > In general semi joins are more difficult to re-order with other joins
> since
> > they do not have neither the commutative nor the associative property.
> > There are other properties that allow the re-ordering of semi-joins but I
> > am not aware if they are exploited by LoptOptimizeJoinRule. If you plan
> > to work on this you may find interesting the work by Moerkotte et. al. on
> > join enumeration [1].
> >
> > Best,
> > Stamatis
> >
> > [1] On the Correct and Complete Enumeration of the CoreSearch Space
> > <
> >
> https://www.researchgate.net/profile/Guido_Moerkotte/publication/262216932_On_the_correct_and_complete_enumeration_of_the_core_search_space/links/58467f2c08ae61f75ddb205c/On-the-correct-and-complete-enumeration-of-the-core-search-space.pdf
> > >
> >
> >
> > Στις Δευ, 14 Ιαν 2019 στις 7:31 π.μ., ο/η Julian Hyde <[email protected]>
> > έγραψε:
> >
> > > I don't see any reason in principle why LoptOptimizeJoinRule could not
> > > handle semi-joins. Of course it will have to remember that these are
> > > semi-joins so that it can re-create them as semi-joins afterwards. And
> > > there may be complications because a semi-join eliminates duplicates
> > > in its right-hand input (or rather, the number of rows is not
> > > multiplied if a key has multiple values) and therefore moving a
> > > semi-join up and down the join graph may not affect row-counts in the
> > > expected way.
> > >
> > > It's certainly beneficial to push semi-joins down in many cases.
> > >
> > > Julian
> > >
> > >
> > > On Fri, Jan 11, 2019 at 8:41 PM hanu mapr <[email protected]> wrote:
> > > >
> > > > Hello all,
> > > >
> > > > I am working on an issue wherein one of the tpcds query is regressing
> > > when
> > > > semijoin is also present with other inner joins in the query.
> > Currently,
> > > > DRILL uses LoptOptimizeJoinRule for join order optimization. During
> the
> > > > debugging process, I observed that LoptOptimizeJoinRule is not
> > > considering
> > > > semi joins for join order optimization.
> > > > Please do let me know if my understanding is correct. If so what is
> the
> > > > best option for handling this scenario?.
> > > >
> > > > Thanks in advance.
> > > >
> > > > Regards,
> > > > -Hanu
> > >
> >
>