BETWEEN performance

George Pavlov Fri, 09 Feb 2007 11:22:23 -0800

Thanks all for the various useful thoughts. Let me backtrack a bit and
state my real underlying issue a bit with actual examples. Hope not to
bore you with the length of this. Looks to me like an optimizer issue
unless I am missing something.


So, suppose I have a query:

select *
from stuff
inner join 
  ( -- just getting the distinct
    -- user-stuff associations
    -- since there may be multiple;
    -- ultimately I need to query by user
   select stuff_id, user_id 
   from stuff_user
   group by 1,2 -- GROUP BY outperforms DISTINCT
  ) su
  using (stuff_id)
left join
  ( -- this obtains summary statistics
    -- about each stuff item
   select stuff_id, count(*)
   from stuff_events
   group by 1
  ) se
using (stuff_id)
where user_id = 41

This is a very pared down version of what I have. And yes this specific
query can be rewritten as a single GROUP BY, but in the real world I am
gathering the aggregate statistics from several tables, so I actually
have several sub-recordsets similar to the one called "se" above.
Rewriting ALL those as a single GROUP BY is not feasible. I know, all
this cries for a single summarized rollup table, but let's not go there
(for now).

So running the above is inefficient. This particular user_id has only
one associated stuff_id and does not even have much data for that in
stuff_events. The query runs in ~4600ms. Were I to query by stuff_id
instead, things look great (if I change the where clause to the stuff_id
it runs in 25ms). When I query based on stuff_id the optimizer uses an
index on stuff_events.stuff_id. However, when I query by user_id it does
a Seq Scan on stuff_events. I somehow wish I could tell the optimizer to
first figure out which stuff_ids are related to the user_id that is
being asked for and then look ONLY those up in the stuff_events table
using the index on stuff_id. 

It would seem (and this is where we get back to my original question)
that one should be able to just say:

select *
from stuff
left join
  (select stuff_id, count(*)
   from stuff_events
   group by 1
  ) se
using (stuff_id)
where stuff_id in 
  (select distinct stuff_id
   from stuff_user
   where user_id = 41
  )

You'd think that the subquery in the IN would be (very quickly) resolved
to a list of stuff_ids and then stuff_events would be accessed via its
stuff_id index. Instead, the Seq Scan on stuff_events still happens and
the query actually is even slower than the original, running in ~5500ms.

So one (very ugly) way to optimize the first query is to add an extra
join to stuff_user INSIDE the "se" subquery:

select *
from stuff
inner join 
  (select stuff_id, user_id 
   from stuff_user
   group by 1,2
  ) su
  using (stuff_id)
left join
  (select stuff_id, user_id, count(*)
   from stuff_events
   inner join 
     ( -- same subquery as above
      select stuff_id, user_id 
      from stuff_user
      group by 1,2  
     ) su2
     using (stuff_id)
   group by 1,2
  ) se
using (stuff_id)
where user_id = 41;

This does improve things a lot, bringing the execution time for this
particular user to 3ms (!), but it is quite ugly and not fast enough for
me for a user_id with lots of associated stuff_ids.

George

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match

Re: [SQL] huge disparities in =/IN/BETWEEN performance

Reply via email to