Keith, this definitely explains the observed time as it is relative to
count(a)*count (ab)**2, thus non-linear.
And a correlated sub-query is generally recalculated for each row.
But I do not agree with everything.
In my example it is correlated to the outermost query, and not to the
sub-query in which it occurs.
Theoretically the optimizer can take this into account and only
recalculate for each row in the outermost query. And if I'm not
mistaken Postgress does so. Below is a version modified for pgsql that
runs fast no matter the number of rows.
Thanks for the suggested change, where the minimum size is computed is
a sub-query (not sub-sub) and joined to the other sub-query. This is
so elegant. I still need to compare the timing to David's version and
use the fastest.
/* sudo -u postgres psql < issue2p.sql */
drop table if exists a
;
drop table if exists ab
;
create table a (a int, primary key (a))
;
create table ab (a int, b int, size int, primary key (a,b))
;
insert into a
with recursive i as (select 1 as i union all select i+1 from i where
i<3)
select i from i
;
insert into ab
with recursive i as (select 1 as i union all select i+1 from i where
i<10000)
select a, i as b, (a+i)%10 as size from a, i
;
select a,
(
select count(*)
from ab
where a=a.a
and size=(select min(size) from ab where a=a.a)
)
from a
;
Keith Medcalf wrote:
Well of course. You are aware that a correlated subquery means "for
each candidate result execute the query"?
So as you have formulated the query it means:
for each row in a
compute the result count which
for each ab candidate row
calculate whether it is the minimum
which means that the you have requested that the same result be
computed many times over. You have requested exampination of
count(a) * count(ab) * count(ab) rows.
Instead you should be computing the min(size) for each group of a
once, and using that value in the correlated subquery
select a.a,
(
select count(*)
from ab
where a == a.a
and size == b.size
) as acount
from a,
(
select a,
min(size) as size
from ab
group by a
) as b
where a.a == b.a;
This will result in scanning count(ab) + count(a) * count(ab) rows.
Which is significantly less. On my computer it reduces the
execution time of the original query you posited from 400 ticks to
less than 1 tick (ie, from 500 ms to <8 ms)
I do not know if any optimizer can flatten you original query to any
significant degree. Some optimizers may arrive at my fixed up query
because they are capable of doing a hash table lookup on the result
of the innermost correlate. SQLite does not do that, and without
that capability I do not think there is a relational database query
optimizer on the planet that can help you.
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users