Re: [PATCH] (DERBY-251) DISTINCT query is returning duplicate rows

Mamta Satoor Wed, 11 May 2005 20:29:09 -0700

Thanks for committing the patch. Also, the tests ran fine with no
failure with the patch.


Mamta

On 5/11/05, Satheesh Bandaram <[EMAIL PROTECTED]> wrote:
> Submitted this patch. Thanks for addressing the review comments quickly.
> 
> Satheesh
> 
> Sending       
> java\engine\org\apache\derby\impl\sql\compile\FromList.java
> Sending       
> java\testing\org\apache\derbyTesting\functionTests\master\distinctElimination.out
> Sending       
> java\testing\org\apache\derbyTesting\functionTests\tests\lang\distinctElimination.sql
> Transmitting file data ...
> Committed revision 169735.
> 
> Mamta Satoor wrote:
> 
> Hi Jack,

Appreciate you taking the time to do the review and catching
> the
predicate with expression. I have changed the code and also added a
test
> case for it. I have fired the derbyall suite on my codeline to
make sure
> everything else runs smoothly. Attached is the updated
> patch
anyways.

thanks,
Mamta

On 5/11/05, Jack Klebanoff
> <[EMAIL PROTECTED]> wrote:
 
> The patch does not completely fix the problem. It does not handle the
case
> where the exists table column is embedded in an expression. Try
> the
following variation on the test select:

select distinct q1."NO1" from
> IDEPT q1, IDEPT q2
where ( q2."DISCRIM_DEPT" = 'HardwareDept')
and (
> q1."DISCRIM_DEPT" = 'SoftwareDept') and ( q1."NO1" <> ALL
(select q3."NO1"
> from IDEPT q3 where ( ABS(q3."REPORTTO_NO") = q2."NO1")))

Because
> q3."REPORTTO_NO" is inside a call to ABS the code added
> to
FromList.returnsAtMostSingleRow does not see it.

I would suggest using

> JBitSet referencedTables =
and.getLeftOperand().getTablesReferenced();
 if(
> referencedTables.get( existsTableNumber))
> {

predicatesTemp.removeElementAt(predicatesTempIndex);
 break;
 }

instead
> of
 BinaryRelationalOperatorNode beon =
(BinaryRelationalOperatorNode)
> and.getLeftOperand();
 ValueNode left = beon.getLeftOperand();
 ValueNode
> right = beon.getRightOperand();

 /* If left or right side of predicate
> refer to
exists base table,
 then remove it */
 if ((left instanceof
> ColumnReference) &&
 ((ColumnReference) left).getTableNumber()
> ==
existsTableNumber)
> {

predicatesTemp.removeElementAt(predicatesTempIndex);
 break;
 }
 else if
> ((right instanceof ColumnReference) &&
 ((ColumnReference)
> right).getTableNumber()
== existsTableNumber)
> {

predicatesTemp.removeElementAt(predicatesTempIndex);
 break;
 }

I have
> tried it out and it seems to work.

Jack Klebanoff

Mamta Satoor wrote:

 
> Hi,

I have a patch for this optimizer bug. Basically, the issue turned
> out
to be the logic for DISTINCT elimination. During the optimization
phase,
> if a query has DISTINCT clause, then impl.sql.compile.FromList
class's
> returnsAtMostSingleRow() method gets called. This method
returns true if the
> method concludes that DISTINCT in the query is
redundant (based on a complex
> logic that decides that the query is
going to return distinct rows on its
> own without the DISTINCT clause.
The details of the current logic for
> DISTINCT elimination can be found
in the comments at the method level.)

For
> the query in question in this bug, the method returned true for
DISTINCT
> elimination which is wrong. The explanation is as follows.

First of all, I
> was able to simplify the query reported in the bug to
following
> query.
select distinct q1."NO1" from IDEPT q1, IDEPT q2
where (
> q2."DISCRIM_DEPT" = 'HardwareDept')
and ( q1."DISCRIM_DEPT" =
> 'SoftwareDept') and ( q1."NO1" <> ALL
(select q3."NO1" from IDEPT q3 where
> (q3."REPORTTO_NO" = q2."NO1")))

This query gets converted to following
> during optimization
select distinct q1."NO1" from IDEPT q1, IDEPT q2
where (
> q2."DISCRIM_DEPT" = 'HardwareDept')
and ( q1."DISCRIM_DEPT" =
> 'SoftwareDept') and not exists (
(select q3."NO1" from IDEPT q3 where
(
> q3."REPORTTO_NO" = q2."NO1" and q3."NO1" = q1."NO1") ) ) ;

This optimized
> query has 4 predicates associated with it
q3.reportto_no =
> q2.no1
q2.discrim_dept = 'HardwareDept'
q1.descrim_dept =
> 'SoftwareDept'
q1.no1 = q3.no1

Next, on this optimized query(since it has
> DISTINCT clause in it), the
returnsAtMostSingleRow() method gets called. The
> method incorrectly
returns true indicating that DISTINCT can be eliminated.
> The reason
for this is that method is looking at predicates that belong to
> the
inside query with the exists clause (which is on table IDEPT q3)
> to
determine DISTINCT elimination for the outer level.

The fix is that the
> predicates from the exists query, (in this
particular case, q3."NO1" =
> q1."NO1" and q3.reportto_no = q2.no1)
should not be considered when deciding
> elimination of DISTINCT in the
outer query. That is what the attached patch
> does.

Hope this helps understand the problem and the proposed fix for
> it.
The files impacted by the change are as follows
svn stat
M
> java\engine\org\apache\derby\impl\sql\compile\FromList.java
M
> java\testing\org\apache\derbyTesting\functionTests\tests\lang\distinctElimination.sql
M
> java\testing\org\apache\derbyTesting\functionTests\master\distinctElimination.out

Please
> send in comments you may have. I have run the existing tests
and the patch
> didn't cause any failures.

thanks,
Mamta


 
> >
 
> ________________________________
> 
Index:
> java/engine/org/apache/derby/impl/sql/compile/FromList.java
===================================================================
---
> java/engine/org/apache/derby/impl/sql/compile/FromList.java
> (revision 169642)
+++
> java/engine/org/apache/derby/impl/sql/compile/FromList.java
> (working copy)
@@ -1040,7 +1040,7 @@
 }
 
 /**
- * Decrement (query block)
> level (0-based) for 
+ * Decrement (query block) level (0-based) for
 * all
> of the tables in this from list.
 * This is useful when flattening a
> subquery.
 *
@@ -1069,7 +1069,7 @@
 }
 }
 
- 
+
 /**
 * This method is used
> for both subquery flattening and distinct
 * elimination based on a
> uniqueness condition. For subquery
@@ -1079,7 +1079,7 @@
 * any duplicates.
> * This is true if every table in the from list is
 * (a base table and the
> set of columns from the table that
- * are in equality comparisons with
> expressions that do not include columns 
+ * are in equality comparisons
> with expressions that do not include columns
 * from the same table is a
> superset of any unique index
 * on the table) or an EXISTS FBT. In addition,
> at least 1 of the tables
 * in the list has a set of columns in equality
> comparisons with expressions
@@ -1105,28 +1105,28 @@
 * create an array of
> columns from the table(eqOuterCol)
 * (this is used to determine that only
> one row will be returned
 * from a join)
- * 
+ *
 * if the current table is
> the table for the result columns
 * set the result columns in the eqOuterCol
> and tableColMap
 * (if these columns are a superset of a unique index and
 *
> all joining tables result in only one row, the
 * results will be distinct)
> * go through all the predicates and update tableColMap and
- * eqOuterCol
> with join columns and correlation variables, 
+ * eqOuterCol with join
> columns and correlation variables,
 * parameters and constants
 * since
> setting constants, correlation variables and parameters,
- * reduces the
> number of columns required for uniqueness in a 
+ * reduces the number of
> columns required for uniqueness in a
 * multi-column index, they are set for
> all the tables (if the
 * table is not the result table, in this case only
> the column of the
 * result table is set)
 * join columns are just updated
> for the column in the row of the
 * joining table.
- * 
- * check if the
> marked columns in tableColMap are a superset of a unique 
- * index 
+ *
+ *
> check if the marked columns in tableColMap are a superset of a unique
+ *
> index
 * (This means that the join will only produce 1 row when joined
 *
> with 1 row of another table)
- * check that there is a least one table for
> which the columns in 
+ * check that there is a least one table for which
> the columns in
 * eqOuterCol(i.e. constant values) are a superset of a
> unique index
 * (This quarantees that there will be only one row selected
 *
> from this table).
@@ -1134,8 +1134,8 @@
 * Once all tables have been
> evaluated, check that all the tables can be
 * joined by unique index or
> will have only one row
 *
- * 
 *
+ *
 * @param rcl If non-null, the RCL
> from the query block.
 * If non-null for subqueries, then entry can
 * be
> considered as part of an = comparison.
@@ -1150,8 +1150,8 @@
 *
 *
> @exception StandardException Thrown on error
 */
- boolean
> returnsAtMostSingleRow(ResultColumnList rcl, 
- ValueNode
> whereClause, 
+ boolean
> returnsAtMostSingleRow(ResultColumnList rcl,
+ ValueNode
> whereClause,
 PredicateList wherePredicates,
 DataDictionary dd)
 throws
> StandardException
@@ -1160,6 +1160,13 @@
 int[] tableNumbers;
> ColumnReference additionalCR = null;
 
+ PredicateList predicatesTemp;
+
> predicatesTemp = (PredicateList) getNodeFactory().getNode(
+
> C_NodeTypes.PREDICATE_LIST, getContextManager());
+ int wherePredicatesSize
> = wherePredicates.size();
+ for (int index = 0; index < wherePredicatesSize;
> index++)
+
> predicatesTemp.addPredicate((Predicate)wherePredicates.elementAt(index));
+
> /* When considering subquery flattening, we are interested
 * in the 1st
> (and only) entry in the RCL. (The RCL will be
 * null if result column is
> not of interest for subquery flattening.)
@@ -1193,6 +1200,77 @@
 {
 return
> false;
 }
+ FromBaseTable fbt = (FromBaseTable) prn.getChildResult();
+
> //Following for loop code is to take care of Derby-251 (DISTINCT returns
+
> //duplicate rows).
+ //Derby-251 returned duplicate rows because we were
> looking at predicates
+ //that belong to existsTable to determine DISTINCT
> elimination
+ //
+ //(Check method level comments to understand DISTINCT
> elimination rules.)
+ //
+ //For one specific example, consider the query
> below
+ //select distinct q1."NO1" from IDEPT q1, IDEPT q2
+ //where (
> q2."DISCRIM_DEPT" = 'HardwareDept')
+ //and ( q1."DISCRIM_DEPT" =
> 'SoftwareDept') and ( q1."NO1" <> ALL
+ //(select q3."NO1" from IDEPT q3
> where (q3."REPORTTO_NO" = q2."NO1")))
+ //(select q3."NO1" from IDEPT q3
> where ( ABS(q3."REPORTTO_NO") = q2."NO1")))
+ //
+ //Table IDEPT in the
> query above has a primary key defined on column "NO1"
+ //This query gets
> converted to following during optimization
+ //
+ //select distinct q1."NO1"
> from IDEPT q1, IDEPT q2
+ //where ( q2."DISCRIM_DEPT" = 'HardwareDept')
+
> //and ( q1."DISCRIM_DEPT" = 'SoftwareDept') and not exists (
+ //(select
> q3."NO1" from IDEPT q3 where
+ //( ( ABS(q3."REPORTTO_NO") = q2."NO1") and
> q3."NO1" = q1."NO1") ) ) ;
+ //
+ //For the optimized query above, Derby
> generates following predicates.
+ //ABS(q3.reportto_no) = q2.no1
+
> //q2.discrim_dept = 'HardwareDept'
+ //q1.descrim_dept = 'SoftwareDept'
+
> //q1.no1 = q3.no1
+ //The predicate ABS(q3."NO1") = q1."NO1" should not be
> considered when trying
+ //to determine if q1 in the outer query has
> equality comparisons. 
+ //Similarly, the predicate q3.reportto_no = q2.no1
> should not be
+ //considered when trying to determine if q2 in the outer
> query has
+ //equality comparisons. To achieve this, predicates based on
> exists base
+ //table q3 (the first and the last predicate) should be
> removed while
+ //evaluating outer query for uniqueness.
+ //
+ if
> (fbt.getExistsBaseTable())
+ {
+ int existsTableNumber =
> fbt.getTableNumber();
+ int predicatesTempSize = predicatesTemp.size();
+
> for (int predicatesTempIndex = predicatesTempSize-1;
+ predicatesTempIndex
> >= 0; predicatesTempIndex--)
+ {
+ AndNode topAndNode = (AndNode)
+
> ((Predicate)
> predicatesTemp.elementAt(predicatesTempIndex)).getAndNode();
+
+
> for (ValueNode whereWalker = topAndNode; whereWalker instanceof AndNode;
+
> whereWalker = ((AndNode) whereWalker).getRightOperand())
+ {
+ // See if
> this is a candidate =
+ AndNode and = (AndNode) whereWalker;
+
+ //we only
> need to worry about equality predicates because only those
+ //predicates
> are considered during DISTINCT elimination.
+ if
> (!and.getLeftOperand().isRelationalOperator() ||
+
> !(((RelationalOperator)(and.getLeftOperand())).getOperator() ==
+
> RelationalOperator.EQUALS_RELOP))
+ {
+ continue;
+ }
+
+ JBitSet
> referencedTables = and.getLeftOperand().getTablesReferenced();
+ if
> (referencedTables.get(existsTableNumber))
+ {
+
> predicatesTemp.removeElementAt(predicatesTempIndex);
+ break;
+ }
+ }
+ }
+
> }
 }
 
 /* Build an array of tableNumbers from this query block.
@@ -1245,7
> +1323,7 @@
 /* Now see if there are any equality conditions
 * of interest
> in the where predicates.
 */
-
> wherePredicates.checkTopPredicatesForEqualsConditions(
+
> predicatesTemp.checkTopPredicatesForEqualsConditions(
> tableNumber, eqOuterCols, tableNumbers,
 tableColMap[index],
> resultColTable);
 
@@ -1298,7 +1376,7 @@
 /* unique key join - exists tables
> already marked as 
 * 1 row - so don't need to look at them
 */
- if
> (!oneRow[i] && tableColMap[i][index].get(0)) 
+ if (!oneRow[i] &&
> tableColMap[i][index].get(0))
 {
 oneRow[i] = true;
 foundOneRow =
> true;
Index:
> java/testing/org/apache/derbyTesting/functionTests/tests/lang/distinctElimination.sql
===================================================================
---
> java/testing/org/apache/derbyTesting/functionTests/tests/lang/distinctElimination.sql
> (revision 169642)
+++
> java/testing/org/apache/derbyTesting/functionTests/tests/lang/distinctElimination.sql
> (working copy)
@@ -12,6 +12,11 @@
 create unique index three_c1 on
> three(c1);
 create table four(c1 int, c2 int, c3 int, c4 int, c5 int);
> create unique index four_c1c3 on four(c1, c3);
+CREATE TABLE "APP"."IDEPT"
> ("DISCRIM_DEPT" VARCHAR(32), "NO1" INTEGER NOT NULL, 
+"NAME" VARCHAR(50),
> "AUDITOR_NO" INTEGER, "REPORTTO_NO" INTEGER, "HARDWAREASSET"
+ VARCHAR(15),
> "SOFTWAREASSET" VARCHAR(15));
+-- primary/unique
+ALTER TABLE "APP"."IDEPT"
> ADD CONSTRAINT "PK_IDEPT" PRIMARY KEY ("NO1");
 
 insert into one values (1,
> 1, 1, 1, 1);
 insert into one values (2, 1, 1, 1, 1);
@@ -51,6 +56,12 @@
> insert into four values (3, 1, 2, 1, 1);
 insert into four values (3, 1, 3,
> 1, 1);
 
+insert into idept values ('Dept', 1, 'Department1', null, null,
> null, null);
+insert into idept values ('HardwareDept', 2, 'Department2',
> 25, 1, 'hardwareaset2', null);
+insert into idept values ('HardwareDept', 3,
> 'Department3', 25, 2, 'hardwareaset3', null);
+insert into idept values
> ('SoftwareDept', 4, 'Department4', 25, 1, null, 'softwareasset4');
+insert
> into idept values ('SoftwareDept', 5, 'Department5', 30, 4, null,
> 'softwareasset5');
+
 call
> SYSCS_UTIL.SYSCS_SET_RUNTIMESTATISTICS(1);
> maximumdisplaywidth 20000;
 
@@ -61,6 +72,24 @@
 -- Following runtime
> statistics output should have Distinct Scan in it
 values
> SYSCS_UTIL.SYSCS_GET_RUNTIMESTATISTICS();
 
+--Derby251
> Distinct should not get eliminated for following query
+--because there is
> no equality condition on unique column of table
+--in the outside
> query
+select distinct q1."NO1", q1."NAME", q1."AUDITOR_NO",
> 
+q1."REPORTTO_NO", q1."DISCRIM_DEPT", q1."SOFTWAREASSET" from 
+IDEPT q1,
> IDEPT q2 where ( q2."DISCRIM_DEPT" = 'HardwareDept') 
+ and (
> q1."DISCRIM_DEPT" = 'SoftwareDept') and ( q1."NO1" 
+<> ALL ( select
> q3."NO1" from IDEPT q3 where ( ( 
+q3."DISCRIM_DEPT" = 'Dept') or (
> q3."DISCRIM_DEPT" = 
+'HardwareDept') or ( q3."DISCRIM_DEPT" =
> 'SoftwareDept') ) 
+and ( q3."REPORTTO_NO" = q2."NO1") ) ) ;
+--
+--Another
> test case of Derby251 where the exists table column is embedded in an
> expression.
+select distinct q1."NO1" from IDEPT q1, IDEPT q2
+where (
> q2."DISCRIM_DEPT" = 'HardwareDept')
+and ( q1."DISCRIM_DEPT" =
> 'SoftwareDept') and ( q1."NO1" <> ALL
+(select q3."NO1" from IDEPT q3 where
> ( ABS(q3."REPORTTO_NO") = q2."NO1")));
+
 -- result ordering is not
> guaranteed, but order by clause will change how
 -- distinct is executed. So
> test by retrieving data into a temp table and
 -- return results ordered
> after making sure the query was executed as expected.
@@ -170,3 +199,4 @@
> drop table two;
 drop table three;
 drop table four;
+drop table
> idept;
Index:
> java/testing/org/apache/derbyTesting/functionTests/master/distinctElimination.out
===================================================================
---
> java/testing/org/apache/derbyTesting/functionTests/master/distinctElimination.out
> (revision 169642)
+++
> java/testing/org/apache/derbyTesting/functionTests/master/distinctElimination.out
> (working copy)
@@ -19,6 +19,13 @@
 0 rows inserted/updated/deleted
 ij>
> create unique index four_c1c3 on four(c1, c3);
 0 rows
> inserted/updated/deleted
+ij> CREATE TABLE "APP"."IDEPT" ("DISCRIM_DEPT"
> VARCHAR(32), "NO1" INTEGER NOT NULL, 
+"NAME" VARCHAR(50), "AUDITOR_NO"
> INTEGER, "REPORTTO_NO" INTEGER, "HARDWAREASSET"
+ VARCHAR(15),
> "SOFTWAREASSET" VARCHAR(15));
+0 rows inserted/updated/deleted
+ij> --
> primary/unique
+ALTER TABLE "APP"."IDEPT" ADD CONSTRAINT "PK_IDEPT" PRIMARY
> KEY ("NO1");
+0 rows inserted/updated/deleted
 ij> insert into one values
> (1, 1, 1, 1, 1);
 1 row inserted/updated/deleted
 ij> insert into one values
> (2, 1, 1, 1, 1);
@@ -87,6 +94,16 @@
 1 row inserted/updated/deleted
 ij>
> insert into four values (3, 1, 3, 1, 1);
 1 row
> inserted/updated/deleted
+ij> insert into idept values ('Dept', 1,
> 'Department1', null, null, null, null);
+1 row inserted/updated/deleted
+ij>
> insert into idept values ('HardwareDept', 2, 'Department2', 25, 1,
> 'hardwareaset2', null);
+1 row inserted/updated/deleted
+ij> insert into
> idept values ('HardwareDept', 3, 'Department3', 25, 2, 'hardwareaset3',
> null);
+1 row inserted/updated/deleted
+ij> insert into idept values
> ('SoftwareDept', 4, 'Department4', 25, 1, null, 'softwareasset4');
+1 row
> inserted/updated/deleted
+ij> insert into idept values ('SoftwareDept', 5,
> 'Department5', 30, 4, null, 'softwareasset5');
+1 row
> inserted/updated/deleted
 ij> call
> SYSCS_UTIL.SYSCS_SET_RUNTIMESTATISTICS(1);
 0 rows
> inserted/updated/deleted
 ij> maximumdisplaywidth 20000;
@@ -143,6 +160,31
> @@
 None
 next qualifiers:
 None
+ij> --Derby251 Distinct should not get
> eliminated for following query
+--because there is no equality condition on
> unique column of table
+--in the outside query
+select distinct q1."NO1",
> q1."NAME", q1."AUDITOR_NO", 
+q1."REPORTTO_NO", q1."DISCRIM_DEPT",
> q1."SOFTWAREASSET" from 
+IDEPT q1, IDEPT q2 where ( q2."DISCRIM_DEPT" =
> 'HardwareDept') 
+ and ( q1."DISCRIM_DEPT" = 'SoftwareDept') and ( q1."NO1"
> 
+<> ALL ( select q3."NO1" from IDEPT q3 where ( ( 
+q3."DISCRIM_DEPT" =
> 'Dept') or ( q3."DISCRIM_DEPT" = 
+'HardwareDept') or ( q3."DISCRIM_DEPT" =
> 'SoftwareDept') ) 
+and ( q3."REPORTTO_NO" = q2."NO1") ) ) ;
+NO1 |NAME
> |AUDITOR_NO |REPORTTO_NO|DISCRIM_DEPT |SOFTWAREASSET
> 
+---------------------------------------------------------------------------------------------------------------------------------------
+4
> |Department4 |25 |1 |SoftwareDept |softwareasset4 
+5 |Department5 |30 |4
> |SoftwareDept |softwareasset5 
+ij> --
+--Another test case of Derby251
> where the exists table column is embedded in an expression.
+select distinct
> q1."NO1" from IDEPT q1, IDEPT q2
+where ( q2."DISCRIM_DEPT" =
> 'HardwareDept')
+and ( q1."DISCRIM_DEPT" = 'SoftwareDept') and ( q1."NO1" <>
> ALL
+(select q3."NO1" from IDEPT q3 where ( ABS(q3."REPORTTO_NO") =
> q2."NO1")));
+NO1 
+-----------
+4 
+5 
 ij> -- result ordering is not
> guaranteed, but order by clause will change how
 -- distinct is executed. So
> test by retrieving data into a temp table and
 -- return results ordered
> after making sure the query was executed as expected.
@@ -2319,4 +2361,6 @@
> 0 rows inserted/updated/deleted
 ij> drop table four;
 0 rows
> inserted/updated/deleted
+ij> drop table idept;
+0 rows
> inserted/updated/deleted
 ij> 
 
>

Re: [PATCH] (DERBY-251) DISTINCT query is returning duplicate rows

Reply via email to