Thank you Julian and Stamatis for the clarification. While debugging I
observed that in MultiJoin operator there were only two tables
participating in the join ordering for the below query.
Here is the query which I could see a difference in the plan (in Drill).
0: jdbc:drill:zk=local> alter session set `planner.enable_semijoin` = true;
+-------+-----------------------------------+
| ok | summary |
+-------+-----------------------------------+
| true | planner.enable_semijoin updated. |
+-------+-----------------------------------+
1 row selected (0.073 seconds)
0: jdbc:drill:zk=local> select * from
dfs.`/home/mapr/data/sf1/parquet/web_sales` ws1,
dfs.`/home/mapr/data/sf1/parquet/web_sales` ws2 where
ws1.ws_ship_date_sk in (select dd.d_date_sk from
dfs.`/home/mapr/data/sf1/parquet/date_dim` dd) and ws1.ws_order_number
= ws2.ws_order_number;
+----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+
| text
| json
|
+----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+
| DrillScreenRel
DrillProjectRel(**=[$0], **0=[$3])
DrillSemiJoinRel(condition=[=($1, $6)], joinType=[inner])
DrillJoinRel(condition=[=($2, $5)], joinType=[inner])
DrillScanRel(table=[[dfs,
/home/mapr/data/sf1/parquet/web_sales]], groupscan=[ParquetGroupScan
[entries=[ReadEntryWithPath
[path=file:/home/mapr/data/sf1/parquet/web_sales]],
selectionRoot=file:/home/mapr/data/sf1/parquet/web_sales, numFiles=1,
numRowGroups=1, usedMetadataFile=false, columns=[`**`,
`ws_ship_date_sk`, `ws_order_number`]]])
DrillScanRel(table=[[dfs,
/home/mapr/data/sf1/parquet/web_sales]], groupscan=[ParquetGroupScan
[entries=[ReadEntryWithPath
[path=file:/home/mapr/data/sf1/parquet/web_sales]],
selectionRoot=file:/home/mapr/data/sf1/parquet/web_sales, numFiles=1,
numRowGroups=1, usedMetadataFile=false, columns=[`**`,
`ws_ship_date_sk`, `ws_order_number`]]])
DrillScanRel(table=[[dfs,
/home/mapr/data/sf1/parquet/date_dim]], groupscan=[ParquetGroupScan
[entries=[ReadEntryWithPath
[path=file:/home/mapr/data/sf1/parquet/date_dim]],
selectionRoot=file:/home/mapr/data/sf1/parquet/date_dim, numFiles=1,
numRowGroups=1, usedMetadataFile=false, columns=[`d_date_sk`]]])
0: jdbc:drill:zk=local> alter session set `planner.enable_semijoin` = false;
+-------+-----------------------------------+
| ok | summary |
+-------+-----------------------------------+
| true | planner.enable_semijoin updated. |
+-------+-----------------------------------+
1 row selected (0.077 seconds)
0: jdbc:drill:zk=local> select * from
dfs.`/home/mapr/data/sf1/parquet/web_sales` ws1,
dfs.`/home/mapr/data/sf1/parquet/web_sales` ws2 where
ws1.ws_ship_date_sk in (select dd.d_date_sk from
dfs.`/home/mapr/data/sf1/parquet/date_dim` dd) and ws1.ws_order_number
= ws2.ws_order_number;
+----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+
| text
| json
|
+----------------------------------------------------------------------------------+----------------------------------------------------------------------------------+
| DrillScreenRel
DrillProjectRel(**=[$0], **0=[$3])
DrillProjectRel(**=[$0], ws_ship_date_sk=[$1],
ws_order_number=[$2], **0=[$4], ws_ship_date_sk0=[$5],
ws_order_number0=[$6], d_date_sk=[$3])
DrillJoinRel(condition=[=($2, $6)], joinType=[inner])
DrillJoinRel(condition=[=($1, $3)], joinType=[inner])
DrillScanRel(table=[[dfs,
/home/mapr/data/sf1/parquet/web_sales]], groupscan=[ParquetGroupScan
[entries=[ReadEntryWithPath
[path=file:/home/mapr/data/sf1/parquet/web_sales]],
selectionRoot=file:/home/mapr/data/sf1/parquet/web_sales, numFiles=1,
numRowGroups=1, usedMetadataFile=false, columns=[`**`,
`ws_ship_date_sk`, `ws_order_number`]]])
DrillAggregateRel(group=[{0}])
DrillScanRel(table=[[dfs,
/home/mapr/data/sf1/parquet/date_dim]], groupscan=[ParquetGroupScan
[entries=[ReadEntryWithPath
[path=file:/home/mapr/data/sf1/parquet/date_dim]],
selectionRoot=file:/home/mapr/data/sf1/parquet/date_dim, numFiles=1,
numRowGroups=1, usedMetadataFile=false, columns=[`d_date_sk`]]])
DrillScanRel(table=[[dfs,
/home/mapr/data/sf1/parquet/web_sales]], groupscan=[ParquetGroupScan
[entries=[ReadEntryWithPath
[path=file:/home/mapr/data/sf1/parquet/web_sales]],
selectionRoot=file:/home/mapr/data/sf1/parquet/web_sales, numFiles=1,
numRowGroups=1, usedMetadataFile=false, columns=[`**`,
`ws_ship_date_sk`, `ws_order_number`]]])
I was actually looking for a plan with DrillSemiJoinRel pushed down when
the semi join is enabled.
Thank you for sharing the paper.
Regards,
-Hanu
On Mon, Jan 14, 2019 at 5:18 AM Stamatis Zampetakis <[email protected]>
wrote:
> @Hanu: Can you clarify a bit more where is the problem?
> From a quick loon in LoptOptimizeJoinRule it seems that all joins to be
> re-ordered are grouped using a MultiJoin operator.
> Have you checked that the SemiJoin in question is part of the MultiJoin
> operator?
>
> In general semi joins are more difficult to re-order with other joins since
> they do not have neither the commutative nor the associative property.
> There are other properties that allow the re-ordering of semi-joins but I
> am not aware if they are exploited by LoptOptimizeJoinRule. If you plan
> to work on this you may find interesting the work by Moerkotte et. al. on
> join enumeration [1].
>
> Best,
> Stamatis
>
> [1] On the Correct and Complete Enumeration of the CoreSearch Space
> <
> https://www.researchgate.net/profile/Guido_Moerkotte/publication/262216932_On_the_correct_and_complete_enumeration_of_the_core_search_space/links/58467f2c08ae61f75ddb205c/On-the-correct-and-complete-enumeration-of-the-core-search-space.pdf
> >
>
>
> Στις Δευ, 14 Ιαν 2019 στις 7:31 π.μ., ο/η Julian Hyde <[email protected]>
> έγραψε:
>
> > I don't see any reason in principle why LoptOptimizeJoinRule could not
> > handle semi-joins. Of course it will have to remember that these are
> > semi-joins so that it can re-create them as semi-joins afterwards. And
> > there may be complications because a semi-join eliminates duplicates
> > in its right-hand input (or rather, the number of rows is not
> > multiplied if a key has multiple values) and therefore moving a
> > semi-join up and down the join graph may not affect row-counts in the
> > expected way.
> >
> > It's certainly beneficial to push semi-joins down in many cases.
> >
> > Julian
> >
> >
> > On Fri, Jan 11, 2019 at 8:41 PM hanu mapr <[email protected]> wrote:
> > >
> > > Hello all,
> > >
> > > I am working on an issue wherein one of the tpcds query is regressing
> > when
> > > semijoin is also present with other inner joins in the query.
> Currently,
> > > DRILL uses LoptOptimizeJoinRule for join order optimization. During the
> > > debugging process, I observed that LoptOptimizeJoinRule is not
> > considering
> > > semi joins for join order optimization.
> > > Please do let me know if my understanding is correct. If so what is the
> > > best option for handling this scenario?.
> > >
> > > Thanks in advance.
> > >
> > > Regards,
> > > -Hanu
> >
>