[ http://issues.apache.org/jira/browse/DERBY-754?page=all ]

A B updated DERBY-754:
----------------------

    Attachment: d754_draft.patch

I'm attaching my first attempt at a patch for this issue.  The patch is called 
"d754_draft.patch", and is labelled as "draft" because I still need to do more 
testing and then add those tests to derbyall before the patch is ready to 
commit.

However, I thought I'd post the patch as it stands right now in order to get 
feedback.  As Jeff Lichtman wrote in another thread: "BTW, the business of 
pushing and pulling predicates during optimization can be hard to understand 
and debug"--and since that's exactly what I'm trying to do, I have a feeling 
this patch will go through several iterations before it's determined to be 
correct and complete.  There's still a lot of testing that I need to do, but I 
want to see if I'm at least headed in the right direction.

What I've done with this patch is attempt to push join predicates down into 
subqueries (namely, SELECT queries) at optimization time, but I only attempt to 
do so IF the join node from which the predicate originated is flattenable.  
Inner joins are by default flattenable, though they can become non-flattenable 
if they are beneath a non-flattenable outer join node that is in turn directly 
beneath a SELECT node.  Outer joins are by default NON-flattenable, though they 
can become flattenable if they can be transformed into an equivalent inner join.

Note that in most cases flattenable joins will in fact be flattened into outer 
SELECT queries, in which case pushing the join predicates down is no longer 
useful, and thus we don't even attempt to do so.  However, there are certain 
(relatively rare) cases where flattenable joins are NOT flattened into outer 
queries--and it's for those cases that my patch attempts to push join 
predicates down to subqueries.  This is an admittedly small scope for my patch, 
but I'm hoping that it can serve as an initial basis for my own understanding 
of how predicates are used in optimization.  Once this patch is finalized, 
improvements/expansions on this type of functionality can hopefully be 
implemented in follow-up submissions.

All of that said, a contrived example of a query that _does_ benefit from this 
work is as follows:

create table t1 ("I" INTEGER, "A" INTEGER);
create table t2 ("J" INTEGER, "B" INTEGER);
create table t3 ("K" INTEGER, "C" INTEGER);
create table t4 ("M" INTEGER NOT NULL, "D" INTEGER);

insert into t1 values (1, -1), (2, -2), (4, -4);
insert into t2 values (3, -3), (4, -4), (5, -5);
insert into t3 values (4, -4), (5, -5), (6, -6), (7, -7), (99, -99);
insert into t4 values (2, -2), (3, -3), (4, -4), (5, -5);

-- This query benefits from pushing join predicates.
select
   t1.i, t4.d, x1.c
from
   t1
   inner join
    (select * from t2, t3) x1
   on x1.j = t1.i and x1.k = t1.i
   left outer join
     t4 as x3
   on x3.m = x1.k
   inner join
     t4
   on t4.d < 0 and x1.k = 4
;

Without the patch, the optimizer will choose an access plan that consists of 4 
table scans, 1 hash scan, 3 nested loop joins, and 1 hash left outer join.  
With the patch, 2 of those table scans become hash scans, and one of the nested 
loop joins becomes a hash join--all thanks to predicate pushing.  With a data 
set as small as this one, the difference in performance is hardly noticeable; 
however, I have other much larger databases for which the use of hash scans and 
hash joins instead of table scans and nested loop joins improves performance by 
as much as 60 times (13 minutes down to 13ish seconds).  Thus I do think this 
change is a useful one for Derby.

I have taken the time to add what I believe to be extensive comments to my 
patch, so I won't go into any more detail here.  I'll leave it up to any 
reviewers out there to look at the patch and ask questions or raise concerns if 
they don't understand what I've done or if they notice something that needs 
correcting.

I ran derbyall with this patch last night and everything passed; I then made 
some "clean-up" changes this morning and ran derbylang, again with no failures. 
 However, one thing I've learned in doing this work is that there are all kinds 
of subtleties that are easily overlooked, so even if the nightlies are passing, 
I would still greatly appreciate the time/feedback anyone out there can 
contribute...

In the meantime, I will be working to add tests for this new functionality to 
derbyall, and will post another version of the patch when I have those tests 
ready.

M      java\engine\org\apache\derby\impl\sql\compile\Predicate.java
M      java\engine\org\apache\derby\impl\sql\compile\SelectNode.java
M      java\engine\org\apache\derby\impl\sql\compile\ResultColumn.java
M      java\engine\org\apache\derby\impl\sql\compile\ProjectRestrictNode.java
M      java\engine\org\apache\derby\impl\sql\compile\JoinNode.java
A      
java\engine\org\apache\derby\impl\sql\compile\BaseTableNumbersVisitor.java
M      java\engine\org\apache\derby\impl\sql\compile\OptimizerImpl.java
M      java\engine\org\apache\derby\impl\sql\compile\PredicateList.java
M      java\engine\org\apache\derby\impl\sql\compile\ColumnReference.java

> Push ON clause predicates down when optimizing SELECT queries.
> --------------------------------------------------------------
>
>          Key: DERBY-754
>          URL: http://issues.apache.org/jira/browse/DERBY-754
>      Project: Derby
>         Type: Improvement
>   Components: Performance
>     Versions: 10.1.2.0, 10.1.2.1, 10.2.0.0
>     Reporter: A B
>     Assignee: A B
>  Attachments: d754_draft.patch
>
> In cases where a SELECT subquery occurs as one of the operands to a Join, it 
> is sometimes beneficial to push join predicates down to the subquery, which 
> allows the optimizer to find a better access path for the subquery and thus 
> can improve query performance.
> For example, take the following query:
> SELECT  t1.a, t1.b, temp.c
> FROM t1
>   LEFT OUTER JOIN (
>     SELECT c,d
>     FROM t2
>   ) as temp
>   ON
>     t1.a = temp.d
>     and temp.d = 8
> ;
> Currently, when optimizing the inner SELECT query, Derby will only pass the 
> inner SELECT's WHERE predicates to the optimizer--the outer ON predicates are 
> ignored.  Thus, in this case, the optimizer will have no predicates to work 
> with and so will do a table scan on t2.
> If, however, Derby were to push the "temp.d = 8" predicate down into the 
> inner SELECT query, the optimizer could use that predicate to make a smarter 
> decision.  For example, if a primary key existed on column "d" in T2, the 
> optimizer could use that and then choose to do an index/hash scan when 
> reading t2 (instead of a table scan).
> Not only does can this kind of "pushing" lead to faster reading of tables, 
> but in some cases where the predicate being pushed references two tables, it 
> can also influence the optimizer's choice of join strategy, which can in turn 
> lead to improved performance.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to