[jira] Commented: (DERBY-47) Some possible improvements to IN optimization

A B (JIRA) Tue, 13 Mar 2007 13:00:40 -0800

    [ 
https://issues.apache.org/jira/browse/DERBY-47?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12480546
 ]


A B commented on DERBY-47:
--------------------------

> 1) In one of the comments, you said:
> 
>    We intentionally use a parameter node instead of a constant node 
>    because the IN-list has more than one value 
> 
>    It's not *always* true that the IN-list has more than one value, right?
>    That is, it would be legal, if not very useful, to write
> 
>    SELECT * FROM t WHERE c IN (1)

Yes, an IN-list can only have a single value.  However, if such an IN-list 
occurs we will convert it into an equality predicate as part of the first "if" 
branch in the preprocess() method:

    /* Check for the degenerate case of a single element in the IN list.
     * If found, then convert to "=".
     */
    if (rightOperandList.size() == 1)
    ...

Thus we won't ever get to the code referenced above.  But you're right, it 
might be good to add an explanatory comment to explain this.

> 2) Do you have any test cases of the form
> 
>    WHERE c IN (SELECT c_prime from t_prime)
> 
>    That is, where the IN-list is neither a list of literal constants, nor a
>    list of parameter markers, but is instead a subquery whose values will
>    be used as the list.

A few test cases for this kind of query already exist in lang/inbetween.sql and 
also in lang/subquery.sql.  So I did not add any new test cases with my 
changes.  Is there anything in particular you think could be problematic here?

>    Does such a query generate and use the new style Multi-Probe processing?

No, it does not.  Multi-probe processing only occurs if the IN-list is solely 
comprised of constant and/or parameter nodes.  A subquery is neither constant 
nor parameter, hence no multi-probing will occur.  The code for this is in the 
preprocess() method of InListOperatorNode:

   else if ((leftOperand instanceof ColumnReference) &&
     rightOperandList.containsOnlyConstantAndParamNodes())
   {
        <create probe predicate>
    }

> 3) Do you have any test cases in which the IN-list predicate references a
>    column in a UNION or a UNION view, thus requiring pushing the IN-list
>    predicate down and pulling it back up?

I added a few, simplified test queries for this situation to lang/inbetween.sql 
as part of d47_mp_addlTestCases.patch (under the heading "Nested queries with 
unions and top-level IN list").  There are also a handful of queries in 
lang/predicatePushdown.sql that include IN lists in addition to equality 
predicates.  Feel free to comment if you think more testing should be done 
here...

> 4) In the change to OrNode.java, my eye was caught by the variable name 
> "beon".
>    I've seen that a common convention is to use an acronym, so maybe "beon"
>    stood for Binary Equality Operator Node, or something like that, but the
>    actual datatype is Binary*Relational*OperatorNode, so I would have
>    expected to see a variable "bron".
> 
>    BinaryRelationalOperatorNode beon = null;
> 
>    It's totally unimportant, but I saw it and thought I'd mention it.

Oops, good catch :)  Just a typo, I will try to fix this up.

> 5) I looked at the updated masters; to my eye they show that the new probe
>    predicate is working, and the optimizer is choosing to use index-to-base
>    processing for these predicates rather than the formerly-chosen table
>    scans, so this looks great to me.

Thank you for taking a look at these--this was an unexpected but very welcome 
review!

> 7) For some reason, I expected to see something more vivid indicating the
>    use of the new execution strategy in the query plans, I thought maybe I'd
>    see something like "MultiProbeTableScanResultSet" in query dumps? Is it
>    just these tests that don't show that, and other query dumps would? 

This had occurred to me, as well, but then I realized that in all of the test 
cases the MultiProbingTableScanResultSet shows up as the child of an 
IndexToBaseRowResultSet, which I think means it doesn't actually get printed in 
the query plans (only the info for the IndexToBaseRowResultSet is printed).  
This is simliar, I think, to BulkTableScanResultSet, which is not printed in 
query plans, either (I don't think...I believe the only way to tell is by 
looking at the fetch size).

> Or is the only indication that the new probing code has been chosen the
> use of the index in place of the table scan?

Right now I think the only way to tell is to look at the use of index *and* the 
number of rows visited/qualified: if we used an index but did not do 
multi-probing, we will probably see far more rows visited than if we used 
multi-probing (that won't always be the case, but is generally true).  Perhaps 
work to add an explicit indication of multi-probing to the query plan can be 
handled as a separate enhancement?

These are great observations and great questions--thank you for posing them!  
And if you have any others, please continue to ask.  I'm definitely grateful 
for the feedback...

> Some possible improvements to IN optimization
> ---------------------------------------------
>
>                 Key: DERBY-47
>                 URL: https://issues.apache.org/jira/browse/DERBY-47
>             Project: Derby
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 10.0.2.0
>         Environment: all
>            Reporter: Sunitha Kambhampati
>         Assigned To: A B
>         Attachments: d47_engine_doNotCommit_v1.patch, 
> d47_engine_doNotCommit_v1.stat, d47_mp_addlTestCases.patch, 
> d47_mp_CBO_MoAP_v1.patch, d47_mp_CBO_MoAP_v1.stat, d47_mp_codeGen_v1.patch, 
> d47_mp_codeGen_v1.stat, d47_mp_exec_v1.patch, d47_mp_exec_v1.stat, 
> d47_mp_masters_v1.patch, d47_mp_preprocess_v1.patch, 
> d47_mp_preprocess_v1.stat, d47_mp_relOpPredCheck_v1.patch, 
> d47_mp_relOpPredCheck_v1.stat, derby-47-performance-data.txt, 
> derby-47-performance-data.txt, Derby47PerformanceTest.java, 
> Derby47PerformanceTest.java, InListOperatorNode.java, 
> QueryPlanUniqueIndexAndWordIndexOneTerm.txt, 
> QueryPlanUniqueIndexAndWordIndexTwoTerms.txt, 
> QueryPlanUniqueIndexOnlyOneTerm.txt, QueryPlanUniqueIndexOnlyTwoTerms.txt, 
> readlocks.diff, readlocks_withContext.diff, readlocks_withContext.diff
>
>
> Consider a simple case of  - 
> A table tbl has 10000 rows, there is a primary key index on i1
> and the query in question is 
>  select * from tbl where i1 in (-1,100000)
> derby does a table scan of the entire table even though the "IN" list has 
> only two values and the comparison is on a field that has an index.
> Briefly looking at the code, it seems like we insert a between and use the IN 
> list to get the start and stop values for the scan. Thus the range of the 
> values in the "IN" list here plays an important role. 
> Thus if the query was changed to select * from tbl where i1 in (-1, 1), an 
> index scan would be chosen.
> It would be nice if we could do something clever in this case where there is 
> clearly an index on the field and the number of values in the IN list is 
> known. Maybe use the rowcount estimate and the IN list size to do some 
> optimizations.  
> - consider the length of the "IN" list to do searches on the table.  ie use 
> the IN list values to do index key searches on the table,
> -or try to convert it to a join. Use the "IN" list values to create a 
> temporary table and do a join. It is most likely that the optimizer will 
> choose the table with "IN" list here as the outer table in the join and thus 
> will do key searches on the larger table. 
> -------------------------------------------------------------------
> some query plans that I logged using derby.language.logQueryPlan=true for 
> some similar queries:
> Table has ascending values from 0 - 9999 for i1. primary key index on i1.
> GMT Thread[UT0,5,main] (XID = 19941), (SESSIONID = 0), select * from 
> scanfixed where i1 in (-1,9999,9998,9997,9996,9995,9994,9993,9992,9991,9990) 
> ******* Project-Restrict ResultSet (2):
> Number of opens = 1
> Rows seen = 10000
> Rows filtered = 9990
> restriction = true
> projection = false
>       constructor time (milliseconds) = 0
>       open time (milliseconds) = 0
>       next time (milliseconds) = 0
>       close time (milliseconds) = 0
>       restriction time (milliseconds) = 0
>       projection time (milliseconds) = 0
>       optimizer estimated row count:          750.38
>       optimizer estimated cost:         8579.46
> Source result set:
>       Table Scan ResultSet for SCANFIXED at read committed isolation level 
> using instantaneous share row locking chosen by the optimizer
>       Number of opens = 1
>       Rows seen = 10000
>       Rows filtered = 0
>       Fetch Size = 16
>               constructor time (milliseconds) = 0
>               open time (milliseconds) = 0
>               next time (milliseconds) = 0
>               close time (milliseconds) = 0
>               next time in milliseconds/row = 0
>       scan information: 
>               Bit set of columns fetched=All
>               Number of columns fetched=9
>               Number of pages visited=417
>               Number of rows qualified=10000
>               Number of rows visited=10000
>               Scan type=heap
>               start position: 
> null          stop position: 
> null          qualifiers:
> Column[0][0] Id: 0
> Operator: <=
> Ordered nulls: false
> Unknown return value: false
> Negate comparison result: false
> Column[0][1] Id: 0
> Operator: <
> Ordered nulls: false
> Unknown return value: true
> Negate comparison result: true
>               optimizer estimated row count:          750.38
>               optimizer estimated cost:         8579.46
> ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------
> l
> 2004-10-14 18:59:47.577 GMT Thread[UT0,5,main] (XID = 19216), (SESSIONID = 
> 0), select * from scanfixed where i1 in 
> (9999,9998,9997,9996,9995,9994,9993,9992,9991,9990) ******* Project-Restrict 
> ResultSet (3):
> Number of opens = 1
> Rows seen = 10
> Rows filtered = 0
> restriction = true
> projection = true
>       constructor time (milliseconds) = 0
>       open time (milliseconds) = 0
>       next time (milliseconds) = 0
>       close time (milliseconds) = 0
>       restriction time (milliseconds) = 0
>       projection time (milliseconds) = 0
>       optimizer estimated row count:            4.80
>       optimizer estimated cost:           39.53
> Source result set:
>       Index Row to Base Row ResultSet for SCANFIXED:
>       Number of opens = 1
>       Rows seen = 10
>       Columns accessed from heap = {0, 1, 2, 3, 4, 5, 6, 7, 8}
>               constructor time (milliseconds) = 0
>               open time (milliseconds) = 0
>               next time (milliseconds) = 0
>               close time (milliseconds) = 0
>               optimizer estimated row count:            4.80
>               optimizer estimated cost:           39.53
>               Index Scan ResultSet for SCANFIXED using index SCANFIXEDX at 
> read committed isolation level using instantaneous share row locking chosen 
> by the optimizer
>               Number of opens = 1
>               Rows seen = 10
>               Rows filtered = 0
>               Fetch Size = 16
>                       constructor time (milliseconds) = 0
>                       open time (milliseconds) = 0
>                       next time (milliseconds) = 0
>                       close time (milliseconds) = 0
>                       next time in milliseconds/row = 0
>               scan information: 
>                       Bit set of columns fetched=All
>                       Number of columns fetched=2
>                       Number of deleted rows visited=0
>                       Number of pages visited=2
>                       Number of rows qualified=10
>                       Number of rows visited=10
>                       Scan type=btree
>                       Tree height=2
>                       start position: 
>       >= on first 1 column(s).
>       Ordered null semantics on the following columns: 
>                       stop position: 
>       > on first 1 column(s).
>       Ordered null semantics on the following columns: 
>                       qualifiers:
> None
>                       optimizer estimated row count:            4.80
>                       optimizer estimated cost:           39.53

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (DERBY-47) Some possible improvements to IN optimization

Reply via email to