Re: What is the correct way to solve DERBY-504?

Rick Hillegas Thu, 18 Aug 2005 10:25:30 -0700

Hi Knut,

Hm, I'm not sure I understand what you've done. Pardon me if I havegarbled your description.


It sounds as though the optimizer has done two things:

1) Rewritten the original query by flattening the inner subquery intothe outer SELECT.


2) Then tried to push the DISTINCT into the Store.

Perhaps the optimzer has made a flattening mistake here. It may haverewritten the query to be


SELECT DISTINCT name, id from names

when it meant to rewrite the query as

SELECT DISTINCT name from names

What I'm saying is, it't not clear to me that what's broken is DISTINCTpush-down. The bug may really be in the subquery flattening logic.

I agree with your misgivings that your proposed solution is overbroad. Ithink the bug has to do with the agreement between the outer DISTINCTlist and the inner SELECT list. They need to be identical. In this casethe outer DISTINCT list has one column but the inner SELECT list hastwo. I think that the number of tables in the inner FROM list should notbe relevant.


Regards,
-Rick



Knut Anders Hatlen wrote:

Hi,

I have found a solution to DERBY-504
(http://issues.apache.org/jira/browse/DERBY-504), but I would like to
get some input, as I'm not sure the solution is quite optimal.

The short version of the bug is that in queries like

 SELECT DISTINCT name FROM (SELECT name, id FROM names) AS n

duplicates are not removed from the result. (I don't think anyone
would/should ever write such a query, but it's still a bug...)

After some investigation I found out that the bug is caused by the
optimizer passing the distinct from the top level query down to the
subquery, thereby scanning the names table for DISTINCT(name, id)
instead of DISTINCT(name).

This particular optimization is meant for cases where the duplicate
elimination could be done during the scanning of the table, and a
comment in org/apache/derby/impl/sql/compile/SelectNode.java mentions
the criteria for using this optimization:

 /* See if we can push duplicate elimination into the store
  * via a hash scan.  This is possible iff:
  *        o  A single table query
  *        o  We haven't merged the order by and distinct sorts.
  *           (Results do not have to be in a particular order.)
  *        o  All entries in the select's RCL are ColumnReferences.
  *        o  No predicates (This is because we currently do not
  *           differentiate between columns referenced in the select
  *           list and columns referenced in other clauses.  In other
  *           words, the store will do duplicate elimination based on
  *           all referenced columns.)

For the single-table-query criterion, the code checks whether the
query has only one source in its FROM clause, and then uses the same
test recursively on that source until the criterion is not met or the
source is a table.

My bug fix works by disabling the recursive check. Instead of running
the test recursively, it just checks whether there is exactly one
source in the FROM clause and that the source is a table, not a
subquery.

Now, this fix works, which is probably most important, but I'm not
sure it is the best solution. At least, I still have some questions.

1. The fix removes optimization in some cases where it could be
   advantageous. It makes sure that the top-level distinct is never
   pushed down to a subquery. Is this an acceptable trade-off? An
   alternative to this would be to make the push more
   intelligent. Right now the distinct scan is pushed to the subquery
   through ResultSetNode.markForDistinctScan(), a method which takes
   no arguments. Perhaps markForDistinctScan() could have the list of
   distinct columns as an argument?

2. The query that exposed this bug is really a distinct select on a
   projection of a projection. A projection of a projection could
   easily be rewritten as a single projection. Should we extend the
   optimizer to handle such cases?

I would really like to hear your comments on these questions!

Thanks!

Re: What is the correct way to solve DERBY-504?

Reply via email to