Re: [PATCH] (DERBY-251) DISTINCT query is returning duplicate rows

Mamta Satoor Wed, 11 May 2005 13:01:09 -0700

Hi Jack,

Appreciate you taking the time to do the review and catching the
predicate with expression. I have changed the code and also added a
test case for it. I have fired the derbyall suite on my codeline to
make sure everything else runs smoothly. Attached is the updated patch
anyways.


thanks,
Mamta

On 5/11/05, Jack Klebanoff <[EMAIL PROTECTED]> wrote:
> The patch does not completely fix the problem. It does not handle the
> case where the exists table column is embedded in an expression. Try the
> following variation on the test select:
> 
> select  distinct  q1."NO1" from IDEPT q1, IDEPT q2
> where  ( q2."DISCRIM_DEPT" = 'HardwareDept')
> and  ( q1."DISCRIM_DEPT" = 'SoftwareDept')  and  ( q1."NO1" <> ALL
> (select  q3."NO1" from IDEPT q3 where  ( ABS(q3."REPORTTO_NO") =  q2."NO1")))
> 
> Because q3."REPORTTO_NO" is inside a call to ABS the code added to
> FromList.returnsAtMostSingleRow does not see it.
> 
> I would suggest using
> 
>                        JBitSet referencedTables =
> and.getLeftOperand().getTablesReferenced();
>                        if( referencedTables.get( existsTableNumber))
>                        {
> 
> predicatesTemp.removeElementAt(predicatesTempIndex);
>                            break;
>                        }
> 
> instead of
>                        BinaryRelationalOperatorNode beon =
> (BinaryRelationalOperatorNode)
>                            and.getLeftOperand();
>                        ValueNode left = beon.getLeftOperand();
>                        ValueNode right = beon.getRightOperand();
> 
>                        /* If left or right side of predicate refer to
> exists base table,
>                        then remove it */
>                        if ((left instanceof ColumnReference) &&
>                            ((ColumnReference) left).getTableNumber() ==
> existsTableNumber)
>                        {
> 
> predicatesTemp.removeElementAt(predicatesTempIndex);
>                            break;
>                        }
>                        else if ((right instanceof ColumnReference) &&
>                            ((ColumnReference) right).getTableNumber()
> == existsTableNumber)
>                        {
> 
> predicatesTemp.removeElementAt(predicatesTempIndex);
>                            break;
>                        }
> 
> I have tried it out and it seems to work.
> 
> Jack Klebanoff
> 
> Mamta Satoor wrote:
> 
> >Hi,
> >
> >I have a patch for this optimizer bug. Basically, the issue turned out
> >to be the logic for DISTINCT elimination. During the optimization
> >phase, if a query has DISTINCT clause, then impl.sql.compile.FromList
> >class's returnsAtMostSingleRow() method gets called. This method
> >returns true if the method concludes that DISTINCT in the query is
> >redundant (based on a complex logic that decides that the query is
> >going to return distinct rows on its own without the DISTINCT clause.
> >The details of the current logic for DISTINCT elimination can be found
> >in the comments at the method level.)
> >
> >For the query in question in this bug, the method returned true for
> >DISTINCT elimination which is wrong. The explanation is as follows.
> >
> >First of all, I was able to simplify the query reported in the bug to
> >following query.
> >select  distinct  q1."NO1" from IDEPT q1, IDEPT q2
> >where  ( q2."DISCRIM_DEPT" = 'HardwareDept')
> >and  ( q1."DISCRIM_DEPT" = 'SoftwareDept')  and  ( q1."NO1" <> ALL
> >(select  q3."NO1" from IDEPT q3 where  (q3."REPORTTO_NO" =  q2."NO1")))
> >
> >This query gets converted to following during optimization
> >select  distinct  q1."NO1" from IDEPT q1, IDEPT q2
> >where  ( q2."DISCRIM_DEPT" = 'HardwareDept')
> >and  ( q1."DISCRIM_DEPT" = 'SoftwareDept')  and  not exists (
> >(select  q3."NO1" from IDEPT q3 where
> >(  q3."REPORTTO_NO" =  q2."NO1"  and q3."NO1" = q1."NO1") ) )  ;
> >
> >This optimized query has 4 predicates associated with it
> >q3.reportto_no = q2.no1
> >q2.discrim_dept = 'HardwareDept'
> >q1.descrim_dept = 'SoftwareDept'
> >q1.no1 = q3.no1
> >
> >Next, on this optimized query(since it has DISTINCT clause in it), the
> >returnsAtMostSingleRow() method gets called. The method incorrectly
> >returns true indicating that DISTINCT can be eliminated. The reason
> >for this is that method is looking at predicates that belong to the
> >inside query with the exists clause (which is on table IDEPT q3) to
> >determine DISTINCT elimination for the outer level.
> >
> >The fix is that the predicates from the exists query, (in this
> >particular case, q3."NO1" = q1."NO1" and q3.reportto_no = q2.no1)
> >should not be considered when deciding elimination of DISTINCT in the
> >outer query. That is what the attached patch does.
> >
> >Hope this helps understand the problem and the proposed fix for it.
> >The files impacted by the change are as follows
> >svn stat
> >M java\engine\org\apache\derby\impl\sql\compile\FromList.java
> >M 
> >java\testing\org\apache\derbyTesting\functionTests\tests\lang\distinctElimination.sql
> >M 
> >java\testing\org\apache\derbyTesting\functionTests\master\distinctElimination.out
> >
> >Please send in comments you may have. I have run the existing tests
> >and the patch didn't cause any failures.
> >
> >thanks,
> >Mamta
> >
> >
> 
>

Index: java/engine/org/apache/derby/impl/sql/compile/FromList.java
===================================================================
--- java/engine/org/apache/derby/impl/sql/compile/FromList.java (revision 
169642)
+++ java/engine/org/apache/derby/impl/sql/compile/FromList.java (working copy)
@@ -1040,7 +1040,7 @@
        }
 
        /**
-        * Decrement (query block) level (0-based) for 
+        * Decrement (query block) level (0-based) for
         * all of the tables in this from list.
         * This is useful when flattening a subquery.
         *
@@ -1069,7 +1069,7 @@
                }
        }
 
-       
+
        /**
         * This method is used for both subquery flattening and distinct
         * elimination based on a uniqueness condition.  For subquery
@@ -1079,7 +1079,7 @@
         * any duplicates.
         * This is true if every table in the from list is
         * (a base table and the set of columns from the table that
-        * are in equality comparisons with expressions that do not include 
columns 
+        * are in equality comparisons with expressions that do not include 
columns
         * from the same table is a superset of any unique index
         * on the table) or an EXISTS FBT.  In addition, at least 1 of the 
tables
         * in the list has a set of columns in equality comparisons with 
expressions
@@ -1105,28 +1105,28 @@
         *              create an array of columns from the table(eqOuterCol)
         *              (this is used to determine that only one row will be 
returned
         *              from a join)
-        *                      
+        *
         *              if the current table is the table for the result columns
         *                      set the result columns in the eqOuterCol and 
tableColMap
         *                      (if these columns are a superset of a unique 
index and
         *                      all joining tables result in only one row, the
         *                      results will be distinct)
         *              go through all the predicates and update tableColMap  
and
-        *              eqOuterCol with join columns and correlation variables, 
+        *              eqOuterCol with join columns and correlation variables,
         *              parameters and constants
         *              since setting constants, correlation variables and 
parameters,
-        *              reduces the number of columns required for uniqueness 
in a 
+        *              reduces the number of columns required for uniqueness 
in a
         *              multi-column index, they are set for all the tables (if 
the
         *              table is not the result table, in this case only the 
column of the
      *         result table is set)
         *              join columns are just updated for the column in the row 
of the
         *              joining table.
-        *              
-        *              check if the marked columns in tableColMap are a 
superset of a unique 
-        *                      index           
+        *
+        *              check if the marked columns in tableColMap are a 
superset of a unique
+        *                      index
         *                      (This means that the join will only produce 1 
row when joined
         *                      with 1 row of another table)
-        *              check that there is a least one table for which the 
columns in 
+        *              check that there is a least one table for which the 
columns in
         *                      eqOuterCol(i.e. constant values) are a superset 
of a unique index
         *                      (This quarantees that there will be only one 
row selected
         *                      from this table).
@@ -1134,8 +1134,8 @@
         *      Once all tables have been evaluated, check that all the tables 
can be
         *      joined by unique index or will have only one row
         *
-        *      
         *
+        *
         * @param rcl                           If non-null, the RCL from the 
query block.
         *                                                      If non-null for 
subqueries, then entry can
         *                                                      be considered 
as part of an = comparison.
@@ -1150,8 +1150,8 @@
         *
         * @exception StandardException         Thrown on error
         */
-       boolean returnsAtMostSingleRow(ResultColumnList rcl, 
-                                                                  ValueNode 
whereClause, 
+       boolean returnsAtMostSingleRow(ResultColumnList rcl,
+                                                                  ValueNode 
whereClause,
                                                                   
PredicateList wherePredicates,
                                                                   
DataDictionary dd)
                throws StandardException
@@ -1160,6 +1160,13 @@
                int[]                   tableNumbers;
                ColumnReference additionalCR = null;
 
+               PredicateList predicatesTemp;
+               predicatesTemp = (PredicateList) getNodeFactory().getNode(
+                       C_NodeTypes.PREDICATE_LIST,     getContextManager());
+               int wherePredicatesSize = wherePredicates.size();
+               for (int index = 0; index < wherePredicatesSize; index++)
+                       
predicatesTemp.addPredicate((Predicate)wherePredicates.elementAt(index));
+
                /* When considering subquery flattening, we are interested
                 * in the 1st (and only) entry in the RCL.  (The RCL will be
                 * null if result column is not of interest for subquery 
flattening.)
@@ -1193,6 +1200,77 @@
                        {
                                return false;
                        }
+                       FromBaseTable fbt = (FromBaseTable) 
prn.getChildResult();
+                       //Following for loop code is to take care of Derby-251 
(DISTINCT returns
+                       //duplicate rows).
+                       //Derby-251 returned duplicate rows because we were 
looking at predicates
+                       //that belong to existsTable to determine DISTINCT 
elimination
+                       //
+                       //(Check method level comments to understand DISTINCT 
elimination rules.)
+                       //
+                       //For one specific example, consider the query below
+                       //select  distinct  q1."NO1" from IDEPT q1, IDEPT q2
+                       //where  ( q2."DISCRIM_DEPT" = 'HardwareDept')
+                       //and  ( q1."DISCRIM_DEPT" = 'SoftwareDept')  and  ( 
q1."NO1" <> ALL
+                       //(select  q3."NO1" from IDEPT q3 where  
(q3."REPORTTO_NO" =  q2."NO1")))
+                       //(select  q3."NO1" from IDEPT q3 where  ( 
ABS(q3."REPORTTO_NO") =  q2."NO1")))
+                       //
+                       //Table IDEPT in the query above has a primary key 
defined on column "NO1"
+                       //This query gets converted to following during 
optimization
+                       //
+                       //select  distinct  q1."NO1" from IDEPT q1, IDEPT q2
+                       //where  ( q2."DISCRIM_DEPT" = 'HardwareDept')
+                       //and  ( q1."DISCRIM_DEPT" = 'SoftwareDept')  and  not 
exists (
+                       //(select  q3."NO1" from IDEPT q3 where
+                       //(  ( ABS(q3."REPORTTO_NO") =  q2."NO1")  and q3."NO1" 
= q1."NO1") ) )  ;
+                       //
+                       //For the optimized query above, Derby generates 
following predicates.
+                       //ABS(q3.reportto_no) = q2.no1
+                       //q2.discrim_dept = 'HardwareDept'
+                       //q1.descrim_dept = 'SoftwareDept'
+                       //q1.no1 = q3.no1
+                       //The predicate ABS(q3."NO1") = q1."NO1" should not be 
considered when trying
+                       //to determine if q1 in the outer query has equality 
comparisons. 
+                       //Similarly, the predicate q3.reportto_no = q2.no1 
should not be
+                       //considered when trying to determine if q2 in the 
outer query has
+                       //equality comparisons. To achieve this, predicates 
based on exists base
+                       //table q3 (the first and the last predicate) should be 
removed while
+                       //evaluating outer query for uniqueness.
+                       //
+                       if (fbt.getExistsBaseTable())
+                       {
+                               int existsTableNumber = fbt.getTableNumber();
+                               int predicatesTempSize = predicatesTemp.size();
+                               for (int predicatesTempIndex = 
predicatesTempSize-1;
+                                       predicatesTempIndex >= 0; 
predicatesTempIndex--)
+                               {
+                                       AndNode topAndNode = (AndNode)
+                                               ((Predicate) 
predicatesTemp.elementAt(predicatesTempIndex)).getAndNode();
+
+                                       for (ValueNode whereWalker = 
topAndNode; whereWalker instanceof AndNode;
+                                               whereWalker = ((AndNode) 
whereWalker).getRightOperand())
+                                       {
+                                               // See if this is a candidate =
+                                               AndNode and = (AndNode) 
whereWalker;
+
+                                               //we only need to worry about 
equality predicates because only those
+                                               //predicates are considered 
during DISTINCT elimination.
+                                               if 
(!and.getLeftOperand().isRelationalOperator() ||
+                                                       
!(((RelationalOperator)(and.getLeftOperand())).getOperator() ==
+                                                       
RelationalOperator.EQUALS_RELOP))
+                                               {
+                                                       continue;
+                                               }
+
+                                               JBitSet referencedTables = 
and.getLeftOperand().getTablesReferenced();
+                                               if 
(referencedTables.get(existsTableNumber))
+                                               {
+                                                       
predicatesTemp.removeElementAt(predicatesTempIndex);
+                                                       break;
+                                               }
+                                       }
+                               }
+                       }
                }
 
                /* Build an array of tableNumbers from this query block.
@@ -1245,7 +1323,7 @@
                        /* Now see if there are any equality conditions
                         * of interest in the where predicates.
                         */
-                       wherePredicates.checkTopPredicatesForEqualsConditions(
+                       predicatesTemp.checkTopPredicatesForEqualsConditions(
                                                                tableNumber, 
eqOuterCols, tableNumbers,
                                                                
tableColMap[index], resultColTable);
 
@@ -1298,7 +1376,7 @@
                                                        /* unique key join - 
exists tables already marked as 
                                                         * 1 row - so don't 
need to look at them
                                                         */
-                                                       if (!oneRow[i] && 
tableColMap[i][index].get(0)) 
+                                                       if (!oneRow[i] && 
tableColMap[i][index].get(0))
                                                        {
                                                                oneRow[i] = 
true;
                                                                foundOneRow = 
true;
Index: 
java/testing/org/apache/derbyTesting/functionTests/tests/lang/distinctElimination.sql
===================================================================
--- 
java/testing/org/apache/derbyTesting/functionTests/tests/lang/distinctElimination.sql
       (revision 169642)
+++ 
java/testing/org/apache/derbyTesting/functionTests/tests/lang/distinctElimination.sql
       (working copy)
@@ -12,6 +12,11 @@
 create unique index three_c1 on three(c1);
 create table four(c1 int, c2 int, c3 int, c4 int, c5 int);
 create unique index four_c1c3 on four(c1, c3);
+CREATE TABLE "APP"."IDEPT" ("DISCRIM_DEPT" VARCHAR(32), "NO1" INTEGER NOT 
NULL, 
+"NAME" VARCHAR(50), "AUDITOR_NO" INTEGER, "REPORTTO_NO" INTEGER, 
"HARDWAREASSET"
+ VARCHAR(15), "SOFTWAREASSET" VARCHAR(15));
+-- primary/unique
+ALTER TABLE "APP"."IDEPT" ADD CONSTRAINT "PK_IDEPT" PRIMARY KEY ("NO1");
 
 insert into one values (1, 1, 1, 1, 1);
 insert into one values (2, 1, 1, 1, 1);
@@ -51,6 +56,12 @@
 insert into four values (3, 1, 2, 1, 1);
 insert into four values (3, 1, 3, 1, 1);
 
+insert into idept values ('Dept', 1, 'Department1', null, null, null, null);
+insert into idept values ('HardwareDept', 2, 'Department2', 25, 1, 
'hardwareaset2', null);
+insert into idept values ('HardwareDept', 3, 'Department3', 25, 2, 
'hardwareaset3', null);
+insert into idept values ('SoftwareDept', 4, 'Department4', 25, 1, null, 
'softwareasset4');
+insert into idept values ('SoftwareDept', 5, 'Department5', 30, 4, null, 
'softwareasset5');
+
 call SYSCS_UTIL.SYSCS_SET_RUNTIMESTATISTICS(1);
 maximumdisplaywidth 20000;
 
@@ -61,6 +72,24 @@
 -- Following runtime statistics output should have Distinct Scan in it
 values SYSCS_UTIL.SYSCS_GET_RUNTIMESTATISTICS();
 
+--Derby251 Distinct should not get eliminated for following query
+--because there is no equality condition on unique column of table
+--in the outside query
+select  distinct  q1."NO1",  q1."NAME",  q1."AUDITOR_NO",  
+q1."REPORTTO_NO",  q1."DISCRIM_DEPT",  q1."SOFTWAREASSET" from 
+IDEPT q1, IDEPT q2 where  ( q2."DISCRIM_DEPT" = 'HardwareDept') 
+ and  ( q1."DISCRIM_DEPT" = 'SoftwareDept')  and  ( q1."NO1" 
+<> ALL  ( select  q3."NO1" from IDEPT q3 where  ( ( 
+q3."DISCRIM_DEPT" = 'Dept')  or  ( q3."DISCRIM_DEPT" = 
+'HardwareDept')  or  ( q3."DISCRIM_DEPT" = 'SoftwareDept') )  
+and  ( q3."REPORTTO_NO" =  q2."NO1") ) )  ;
+--
+--Another test case of Derby251 where the exists table column is embedded in 
an expression.
+select  distinct  q1."NO1" from IDEPT q1, IDEPT q2
+where  ( q2."DISCRIM_DEPT" = 'HardwareDept')
+and  ( q1."DISCRIM_DEPT" = 'SoftwareDept')  and  ( q1."NO1" <> ALL
+(select  q3."NO1" from IDEPT q3 where  ( ABS(q3."REPORTTO_NO") =  q2."NO1")));
+
 -- result ordering is not guaranteed, but order by clause will change how
 -- distinct is executed.  So test by retrieving data into a temp table and
 -- return results ordered after making sure the query was executed as expected.
@@ -170,3 +199,4 @@
 drop table two;
 drop table three;
 drop table four;
+drop table idept;
Index: 
java/testing/org/apache/derbyTesting/functionTests/master/distinctElimination.out
===================================================================
--- 
java/testing/org/apache/derbyTesting/functionTests/master/distinctElimination.out
   (revision 169642)
+++ 
java/testing/org/apache/derbyTesting/functionTests/master/distinctElimination.out
   (working copy)
@@ -19,6 +19,13 @@
 0 rows inserted/updated/deleted
 ij> create unique index four_c1c3 on four(c1, c3);
 0 rows inserted/updated/deleted
+ij> CREATE TABLE "APP"."IDEPT" ("DISCRIM_DEPT" VARCHAR(32), "NO1" INTEGER NOT 
NULL, 
+"NAME" VARCHAR(50), "AUDITOR_NO" INTEGER, "REPORTTO_NO" INTEGER, 
"HARDWAREASSET"
+ VARCHAR(15), "SOFTWAREASSET" VARCHAR(15));
+0 rows inserted/updated/deleted
+ij> -- primary/unique
+ALTER TABLE "APP"."IDEPT" ADD CONSTRAINT "PK_IDEPT" PRIMARY KEY ("NO1");
+0 rows inserted/updated/deleted
 ij> insert into one values (1, 1, 1, 1, 1);
 1 row inserted/updated/deleted
 ij> insert into one values (2, 1, 1, 1, 1);
@@ -87,6 +94,16 @@
 1 row inserted/updated/deleted
 ij> insert into four values (3, 1, 3, 1, 1);
 1 row inserted/updated/deleted
+ij> insert into idept values ('Dept', 1, 'Department1', null, null, null, 
null);
+1 row inserted/updated/deleted
+ij> insert into idept values ('HardwareDept', 2, 'Department2', 25, 1, 
'hardwareaset2', null);
+1 row inserted/updated/deleted
+ij> insert into idept values ('HardwareDept', 3, 'Department3', 25, 2, 
'hardwareaset3', null);
+1 row inserted/updated/deleted
+ij> insert into idept values ('SoftwareDept', 4, 'Department4', 25, 1, null, 
'softwareasset4');
+1 row inserted/updated/deleted
+ij> insert into idept values ('SoftwareDept', 5, 'Department5', 30, 4, null, 
'softwareasset5');
+1 row inserted/updated/deleted
 ij> call SYSCS_UTIL.SYSCS_SET_RUNTIMESTATISTICS(1);
 0 rows inserted/updated/deleted
 ij> maximumdisplaywidth 20000;
@@ -143,6 +160,31 @@
 None
        next qualifiers:
 None
+ij> --Derby251 Distinct should not get eliminated for following query
+--because there is no equality condition on unique column of table
+--in the outside query
+select  distinct  q1."NO1",  q1."NAME",  q1."AUDITOR_NO",  
+q1."REPORTTO_NO",  q1."DISCRIM_DEPT",  q1."SOFTWAREASSET" from 
+IDEPT q1, IDEPT q2 where  ( q2."DISCRIM_DEPT" = 'HardwareDept') 
+ and  ( q1."DISCRIM_DEPT" = 'SoftwareDept')  and  ( q1."NO1" 
+<> ALL  ( select  q3."NO1" from IDEPT q3 where  ( ( 
+q3."DISCRIM_DEPT" = 'Dept')  or  ( q3."DISCRIM_DEPT" = 
+'HardwareDept')  or  ( q3."DISCRIM_DEPT" = 'SoftwareDept') )  
+and  ( q3."REPORTTO_NO" =  q2."NO1") ) )  ;
+NO1        |NAME                                              |AUDITOR_NO 
|REPORTTO_NO|DISCRIM_DEPT                    |SOFTWAREASSET  
+---------------------------------------------------------------------------------------------------------------------------------------
+4          |Department4                                       |25         |1   
       |SoftwareDept                    |softwareasset4 
+5          |Department5                                       |30         |4   
       |SoftwareDept                    |softwareasset5 
+ij> --
+--Another test case of Derby251 where the exists table column is embedded in 
an expression.
+select  distinct  q1."NO1" from IDEPT q1, IDEPT q2
+where  ( q2."DISCRIM_DEPT" = 'HardwareDept')
+and  ( q1."DISCRIM_DEPT" = 'SoftwareDept')  and  ( q1."NO1" <> ALL
+(select  q3."NO1" from IDEPT q3 where  ( ABS(q3."REPORTTO_NO") =  q2."NO1")));
+NO1        
+-----------
+4          
+5          
 ij> -- result ordering is not guaranteed, but order by clause will change how
 -- distinct is executed.  So test by retrieving data into a temp table and
 -- return results ordered after making sure the query was executed as expected.
@@ -2319,4 +2361,6 @@
 0 rows inserted/updated/deleted
 ij> drop table four;
 0 rows inserted/updated/deleted
+ij> drop table idept;
+0 rows inserted/updated/deleted
 ij>

Re: [PATCH] (DERBY-251) DISTINCT query is returning duplicate rows

Reply via email to