Hi Andrew,

Just jumping in, here is another approach:

SELECT P.product_id, P.product_name, P.price
FROM  
      products AS P
        INNER JOIN  product_tags AS PT
                ON  P.product_id = PT.product_id
        INNER JOIN  tags AS T
                ON  T.tag_id = PT.tag_id AND T.tag IN ( 'foo', 'bar', 'baz', 
'floob', 'widget' )
GROUP BY    P.product_id
HAVING      COUNT(DISTINCT T.tag_id) = 5

Theoretically, this should result in a smaller temporary table, since moving 
the IN check to the join condition will limit the records stored in a temporary 
table. Also, assuming that product_id is a primary key, you should be save in 
grouping on just P.product_id.

Related to this, I recently had a curious issue. While rewriting a subquery 
into a join, performance dropped dramatically. My hypothesis was that the 
amount of columns MySQL had to store in the temporary table was so large, that 
creating a reduced set of columns with a subquery was actually beneficial.

Vincent de Lau
 [email protected]



> -----Oorspronkelijk bericht-----
> Van: Andrew Ballard [mailto:[email protected]]
> Verzonden: donderdag 1 juli 2010 17:36
> Aan: Bill Karwin
> CC: [email protected]
> Onderwerp: Re: [fw-general] Question about SQL (not ZF related, sorry)
> 
> On Wed, Jun 30, 2010 at 4:46 PM, Bill Karwin <[email protected]> wrote:
> > Common wisdom says that one join is always better than three or six,
> but
> > this actually performs much *worse* (if you use MySQL), because
> virtually
> > any query using GROUP BY uses a temporary table, which kills
> performance
> > because disk I/O is expensive, and temporary tables usually write to
> disk.
> >
> > This might not be such a bad solution if you use some other brand of
> > database that optimizes GROUP BY better than MySQL.  So you should
> analyze
> > queries with EXPLAIN (or equivalent) and profile carefully with
> sample data
> > of realistic size.
> >
> > Regards,
> > Bill Karwin
> >
> 
> Interesting. I'll defer to you on MySQL since you know it much better
> than I (and I don't have anything running on MySQL that I could test
> quickly right now anyway). Still, the idea left me somewhat curious,
> so I ran a test on one of our SQL Server databases to see how some
> different strategies compared.
> 
> In the interest of disclosure, the actual tables I used (renamed in
> the examples below) are not very large as databases go. The "products"
> table has almost 54000 rows, the "tags" table has almost 350, and the
> "product_tags" table has about 132000 rows. (The nature of the real
> tables that I tested is that each row in the "products" table has at
> least one row in the "product_tags" table and only six distinct values
> from "tags" have been used so far in "product_tags".)
> 
> I compared three different query strategies (shown below). Of these,
> METHOD 1 had the lowest overall cost in resources, followed closely by
> METHOD 3 (although the total elapsed time for METHOD3 was slightly
> faster than that for METHOD 1). I did notice that METHOD 3 grew more
> expensive the more tags I included in the search since each tag
> requires additional self-joins.
> 
> Given that the code for METHOD 1 was the least complex (and therefore
> the easiest to define using Zend_Db_Select, just to keep this somewhat
> on topic) and had the lowest cost and second lowest overall elapsed
> time, I would go that route on SQL Server. As I understand, you're
> saying this is not necessarily the case for MySQL.
> 
> -- METHOD 1
> SELECT      P.product_id, P.product_name, P.price
> FROM        products AS P
> INNER JOIN  product_tags AS PT
>                 ON  P.product_id = PT.product_id
> INNER JOIN  tags AS T
>                 ON  T.tag_id = PT.tag_id
> WHERE       T.tag IN ( 'foo', 'bar', 'baz', 'floob', 'widget' )
> GROUP BY    P.product_id, P.product_name, P.price
> HAVING      COUNT(DISTINCT T.tag_id) = 5
> 
> 
> -- METHOD 2
> SELECT      products.product_id, products.product_name, products.price
> FROM        products
> WHERE       EXISTS (
>                 SELECT      product_tags.product_id
>                 FROM        product_tags
>                 INNER JOIN  tags
>                                 ON  tags.tag_id = product_tags.tag_id
>                 WHERE       tags.tag IN ( 'foo', 'bar', 'baz',
> 'floob', 'widget' )
>                   AND       product_tags.product_id =
> products.product_id
>                 GROUP BY    product_tags.product_id
>                 HAVING      COUNT(*) = 5
>             )
> 
> 
> -- METHOD 3
> SELECT      P.product_id, P.product_name, P.price
> FROM        products AS P
> INNER JOIN  product_tags AS PT1
>                 ON  PT1.product_id = P.product_id
> INNER JOIN  tags AS T1
>                 ON  T1.tag_id = PT1.tag_id
>                 AND T1.tag = 'foo'
> INNER JOIN  product_tags AS PT2
>                 ON  PT2.product_id = P.product_id
> INNER JOIN  tags AS T2
>                 ON  T2.tag_id = PT2.tag_id
>                 AND T2.tag = 'bar'
> INNER JOIN  product_tags AS PT3
>                 ON  PT3.product_id = P.product_id
> INNER JOIN  tags AS T3
>                 ON  T3.tag_id = PT3.tag_id
>                 AND T3.tag = 'baz'
> INNER JOIN  product_tags AS PT4
>                 ON  PT4.product_id = P.product_id
> INNER JOIN  tags AS T4
>                 ON  T4.tag_id = PT4.tag_id
>                 AND T4.tag = 'floob'
> INNER JOIN  product_tags AS PT5
>                 ON  PT5.product_id = P.product_id
> INNER JOIN  tags AS T5
>                 ON  T5.tag_id = PT5.tag_id
>                 AND T5.tag = 'widget'
> 
> 
> 
> Results:
> METHOD 1: Query cost as calculated by Query Analyzer (relative to
> batch) 3%
> ----------------
> Table 'products'. Scan count 1, logical reads 294, physical reads 0,
> read-ahead reads 0.
> Table 'product_tags'. Scan count 1, logical reads 344, physical reads
> 0, read-ahead reads 0.
> Table 'tags'. Scan count 1, logical reads 5, physical reads 0,
> read-ahead reads 0.
> SQL Server Execution Times:
>    CPU time = 156 ms,  elapsed time = 510 ms.
> 
> (7220 row(s) affected)
> 
> METHOD 2: Query cost as calculated by Query Analyzer (relative to
> batch) 89%
> ----------------
> Table 'tags'. Scan count 131715, logical reads 263430, physical reads
> 0, read-ahead reads 0.
> Table 'product_tags'. Scan count 53968, logical reads 108421, physical
> reads 0, read-ahead reads 0.
> Table 'products'. Scan count 1, logical reads 294, physical reads 0,
> read-ahead reads 0.
> SQL Server Execution Times:
>    CPU time = 953 ms,  elapsed time = 1021 ms.
> 
> (7220 row(s) affected)
> 
> METHOD 3: Query cost as calculated by Query Analyzer (relative to
> batch) 9%
> ----------------
> Table 'tags'. Scan count 5, logical reads 25, physical reads 0,
> read-ahead reads 0.
> Table 'product_tags'. Scan count 5, logical reads 1720, physical reads
> 0, read-ahead reads 0.
> Table 'products'. Scan count 1, logical reads 294, physical reads 0,
> read-ahead reads 0.
> SQL Server Execution Times:
>    CPU time = 266 ms,  elapsed time = 488 ms.
> 
> (7220 row(s) affected)
> 
> 
> Andrew

Reply via email to