Re: [PERFORM] Left Outer Join much faster than non-outer Join?

Simon Riggs Thu, 31 Mar 2005 10:06:14 -0800

On Thu, 2005-03-31 at 00:15 -0800, Ron Mayer wrote:
> Ron Mayer wrote:
> > Tom Lane wrote:
> >> [EMAIL PROTECTED] writes:
> >>> select *
> >>>     from streetname_lookup as sl
> >>>     join city_lookup as cl on (true)
> >>>     left outer join tlid_smaller as ts on (sl.geo_streetname_id = 
> >>>          ts.geo_streetname_id and cl.geo_city_id=ts.geo_city_id)
> >>>     where  str_name='alamo' and  city='san antonio' and state='TX'
> >>> ;
> >> That's a fairly odd query; 
> > 
> > 
> > I think it's a very common type of query in data warehousing.
> > 
> > It's reasonably typical of a traditional star schema where
> > "streetname_lookup" and "city_lookup" are dimension tables
> > and "tlid_smaller" is the central fact table.
>


Yes, agreed.

> Although looking again I must admit the query was
> written unconventionally.  Perhaps those queries are
> remnants dating back to a version when you could
> force join orders this way?
> 
> Perhaps a more common way of writing it would have been:
> 
>    select * from tlid_smaller
>     where geo_streetname_id in (select geo_streetname_id from 
> streetname_lookup where str_name='$str_name')
>       and geo_city_id       in (select geo_city_id from city_lookup where 
> city='$city' and state='$state');
> 
> However this query also fails to use the multi-column
> index on (geo_streetname_id,geo_city_id).  Explain
> analyze shown below.

...which is my understanding too.

> In cases where I can be sure only one result will come
> from each of the lookup queries I guess I can do this:
> 
>    select * from tlid_smaller
>     where geo_streetname_id = (select geo_streetname_id from 
> streetname_lookup where str_name='$str_name')
>       and geo_city_id       = (select geo_city_id from city_lookup where 
> city='$city' and state='$state');
> 
> which has the nicest plan of them all (explain analyze
> also shown below).

Which is not the case for the generalised star join.

The general case query here is:
        SELECT (whatever)
        FROM FACT, DIMENSION1 D1, DIMENSION2 D2, DIMENSION3 D3etc..
        WHERE
                FACT.dimension1_pk = D1.dimension1_pk
        AND     FACT.dimension2_pk = D2.dimension2_pk
        AND     FACT.dimension3_pk = D3.dimension3_pk
        AND     D1.dimdescription = 'X'
        AND     D2.dimdescription = 'Y'
        AND     D3.dimdescription = 'Z'
        ...
with FACT PK=(dimension1_pk, dimension2_pk, dimension3_pk)

with a more specific example of
        SELECT sum(item_price)
        FROM Sales, Store, Item, TTime
        WHERE
                Sales.store_pk = Store.store_pk
        AND     Store.region = 'UK'
        AND     Sales.item_pk = Item.item_pk
        AND     Item.category = 'Cameras'
        AND     Sales.time_pk = TTime.time_pk
        AND     TTime.month = 3
        AND     TTime.year = 2005

A very good plan for solving this, under specific conditions is...
        CartesianProduct(Store, Item, TTime) -> Sales.PK

which accesses the largest table only once.

As Tom says, the current optimizer won't go near that plan, for good
reason, without specifically tweaking collapse limits. I know full well
that any changes in that direction will need to be strong because that
execution plan is very sensitive to even minor changes in data
distribution.

The plan requires some fairly extensive checking to be put into place.
The selectivity of requests against the smaller tables needs to be very
well known, so that the upper bound estimate of cardinality of the
cartesian product is feasible AND still low enough to use the index on
Sales.

This is probably going to need information to be captured on multi-
column index selectivity, to ensure that last part.

It is likely that the statistics targets on the dimension tables would
need to be higher enough to identify MFVs or at least reduce the upper
bound of selectivity. It is also requires the table sizes to be
examined, to ensure this type of plan is considered pointlessly.
Some other systems that support this join type, turn off checking for it
by default. We could do the same with enable_starjoin = off.

Anyway, seems like a fair amount of work there... yes?

Best Regards, Simon Riggs



---------------------------(end of broadcast)---------------------------
TIP 9: the planner will ignore your desire to choose an index scan if your
      joining column's datatypes do not match

Re: [PERFORM] Left Outer Join much faster than non-outer Join?

Reply via email to