Hi Knut,

Hm, I'm not sure I understand what you've done. Pardon me if I have garbled your description.

It sounds as though the optimizer has done two things:

1) Rewritten the original query by flattening the inner subquery into the outer SELECT.

2) Then tried to push the DISTINCT into the Store.

Perhaps the optimzer has made a flattening mistake here. It may have rewritten the query to be

SELECT DISTINCT name, id from names

when it meant to rewrite the query as

SELECT DISTINCT name from names

What I'm saying is, it't not clear to me that what's broken is DISTINCT push-down. The bug may really be in the subquery flattening logic.

I agree with your misgivings that your proposed solution is overbroad. I think the bug has to do with the agreement between the outer DISTINCT list and the inner SELECT list. They need to be identical. In this case the outer DISTINCT list has one column but the inner SELECT list has two. I think that the number of tables in the inner FROM list should not be relevant.

Regards,
-Rick



Knut Anders Hatlen wrote:

Hi,

I have found a solution to DERBY-504
(http://issues.apache.org/jira/browse/DERBY-504), but I would like to
get some input, as I'm not sure the solution is quite optimal.

The short version of the bug is that in queries like

 SELECT DISTINCT name FROM (SELECT name, id FROM names) AS n

duplicates are not removed from the result. (I don't think anyone
would/should ever write such a query, but it's still a bug...)

After some investigation I found out that the bug is caused by the
optimizer passing the distinct from the top level query down to the
subquery, thereby scanning the names table for DISTINCT(name, id)
instead of DISTINCT(name).

This particular optimization is meant for cases where the duplicate
elimination could be done during the scanning of the table, and a
comment in org/apache/derby/impl/sql/compile/SelectNode.java mentions
the criteria for using this optimization:

 /* See if we can push duplicate elimination into the store
  * via a hash scan.  This is possible iff:
  *        o  A single table query
  *        o  We haven't merged the order by and distinct sorts.
  *           (Results do not have to be in a particular order.)
  *        o  All entries in the select's RCL are ColumnReferences.
  *        o  No predicates (This is because we currently do not
  *           differentiate between columns referenced in the select
  *           list and columns referenced in other clauses.  In other
  *           words, the store will do duplicate elimination based on
  *           all referenced columns.)

For the single-table-query criterion, the code checks whether the
query has only one source in its FROM clause, and then uses the same
test recursively on that source until the criterion is not met or the
source is a table.

My bug fix works by disabling the recursive check. Instead of running
the test recursively, it just checks whether there is exactly one
source in the FROM clause and that the source is a table, not a
subquery.

Now, this fix works, which is probably most important, but I'm not
sure it is the best solution. At least, I still have some questions.

1. The fix removes optimization in some cases where it could be
   advantageous. It makes sure that the top-level distinct is never
   pushed down to a subquery. Is this an acceptable trade-off? An
   alternative to this would be to make the push more
   intelligent. Right now the distinct scan is pushed to the subquery
   through ResultSetNode.markForDistinctScan(), a method which takes
   no arguments. Perhaps markForDistinctScan() could have the list of
   distinct columns as an argument?

2. The query that exposed this bug is really a distinct select on a
   projection of a projection. A projection of a projection could
   easily be rewritten as a single projection. Should we extend the
   optimizer to handle such cases?

I would really like to hear your comments on these questions!

Thanks!


Reply via email to