[ 
https://issues.apache.org/jira/browse/DERBY-6784?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14250874#comment-14250874
 ] 

Mike Matrigali edited comment on DERBY-6784 at 12/18/14 12:29 AM:
------------------------------------------------------------------

My plan is to address this issue at the point that IN LIST predicate is costed. 
 The first patch I plan to submit
will alter the costing of the predicate list.  It currently multiplies the 
estimated cost of the single predicate by the
list of terms. 

Given Army's observation there is obviously a problem with the costing, which 
may be in over costing the predicate 
plan or under costing the full table merge plan or possibly both.   I believe 
that simply multiplying does over cost
the probe strategy.  The current costing model for scans that store provides 
includes both "cached" and "non-cached"
scan costs.  In the current case it looks like the SQL layer requests an 
"non-cached" estimate for the first probe, which
is reasonable.  But I think it would be better to assume that subsequent probes 
into the index are more likely better
costed as "cached" probes than "uncached" probes.  The original measurements 
used for the cost raw numbers
show it is about 6 times faster to do cached vs. uncached probes.  See tables 
in BTreeCostController.java (we should
update these measurements at some point for faster machines and current code 
paths - but that is a separate job).

So my first attempt at this is going to be to lower the cost of the subsequent 
terms by this factor and see how many
terms vs rows choose multi-probe and verify performance is still better.  
Though I do wonder
what the downside of just changing system to do Army's initial "hack" would be. 
 Seems likely
for most queries if there is an IN-LIST with a useful index it is likely that 
optimizer should be
picking that index and doing probes.


was (Author: mikem):
My plan is to address this issue at the point that IN LIST predicate is costed. 
 The first patch I plan to submit
will alter the costing of the predicate list.  It currently multiplies the 
estimated cost of the single predicate by the
list of terms. 

Given Army's observation there is obviously a problem with the costing, which 
may be in over costing the predicate 
plan or under costing the full table merge plan or possibly both.   I believe 
that simply multiplying does over cost
the probe strategy.  The current costing model for scans that store provides 
includes both "cached" and "non-cached"
scan costs.  In the current case it looks like the SQL layer requests an 
"non-cached" estimate for the first probe, which
is reasonable.  But I think it would be better to assume that subsequent probes 
into the index are more likely better
costed as "cached" probes than "uncached" probes.  The original measurements 
used for the cost raw numbers
show it is about 6 times faster to do cached vs. uncached probes.  See tables 
in BTreeCostController.java (we should
update these measurements at some point for faster machines and current code 
paths - but that is a separate job).

So my first attempt at this is going to be to lower the cost of the subsequent 
terms by this factor and see how many
terms vs rows choose multi-probe and verify performance is still better.

> change optimizer to choose in list multiprobe more often
> --------------------------------------------------------
>
>                 Key: DERBY-6784
>                 URL: https://issues.apache.org/jira/browse/DERBY-6784
>             Project: Derby
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 10.11.1.1
>            Reporter: Mike Matrigali
>            Assignee: Mike Matrigali
>
> Using the  multi-probe join strategy is an obvious performance win when
> the optimizer chooses it.  There are cases currently where the costing 
> makes the optimizer choose other plans which do not perform as well as
> the multi-probe strategy.
> The class of queries that are affected are those where the number of terms
> in the IN LIST is large relative to the number of rows in the table, and there
> is a useful index to probe for the column that is referenced by the IN LIST.
> There are multiple benefits to choosing the multi-probe strategy, including
> the following:
> 1) often better execution time, where the alternative is to do a full table 
>     merge on the column.
> 2) The multi-probe strategy results in "pushing" the work into the store, 
>      and this may result in more concurrent behavior (see DERBY-6300 and 
> DERBY-6301).   First less rows may
>      be locked by probing rather than full table scan (and in the worst case
>      same number if query manages to probe on every value in table).  
>      Second depending on isolation level of the query store will only matching
>      rows, while in the current implementation all rows that are returned by
>      store for qualification above store will remain locked whether they 
>      qualify or not.   Especially in small table cases other query plan 
> choices
>      have been changed to favor probing indexes rather than full table scans
>      even if pure cpu is better with table scan.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to