Hi Andrew,
Just jumping in, here is another approach:
SELECT P.product_id, P.product_name, P.price
FROM
products AS P
INNER JOIN product_tags AS PT
ON P.product_id = PT.product_id
INNER JOIN tags AS T
ON T.tag_id = PT.tag_id AND T.tag IN ( 'foo', 'bar', 'baz',
'floob', 'widget' )
GROUP BY P.product_id
HAVING COUNT(DISTINCT T.tag_id) = 5
Theoretically, this should result in a smaller temporary table, since moving
the IN check to the join condition will limit the records stored in a temporary
table. Also, assuming that product_id is a primary key, you should be save in
grouping on just P.product_id.
Related to this, I recently had a curious issue. While rewriting a subquery
into a join, performance dropped dramatically. My hypothesis was that the
amount of columns MySQL had to store in the temporary table was so large, that
creating a reduced set of columns with a subquery was actually beneficial.
Vincent de Lau
[email protected]
> -----Oorspronkelijk bericht-----
> Van: Andrew Ballard [mailto:[email protected]]
> Verzonden: donderdag 1 juli 2010 17:36
> Aan: Bill Karwin
> CC: [email protected]
> Onderwerp: Re: [fw-general] Question about SQL (not ZF related, sorry)
>
> On Wed, Jun 30, 2010 at 4:46 PM, Bill Karwin <[email protected]> wrote:
> > Common wisdom says that one join is always better than three or six,
> but
> > this actually performs much *worse* (if you use MySQL), because
> virtually
> > any query using GROUP BY uses a temporary table, which kills
> performance
> > because disk I/O is expensive, and temporary tables usually write to
> disk.
> >
> > This might not be such a bad solution if you use some other brand of
> > database that optimizes GROUP BY better than MySQL. So you should
> analyze
> > queries with EXPLAIN (or equivalent) and profile carefully with
> sample data
> > of realistic size.
> >
> > Regards,
> > Bill Karwin
> >
>
> Interesting. I'll defer to you on MySQL since you know it much better
> than I (and I don't have anything running on MySQL that I could test
> quickly right now anyway). Still, the idea left me somewhat curious,
> so I ran a test on one of our SQL Server databases to see how some
> different strategies compared.
>
> In the interest of disclosure, the actual tables I used (renamed in
> the examples below) are not very large as databases go. The "products"
> table has almost 54000 rows, the "tags" table has almost 350, and the
> "product_tags" table has about 132000 rows. (The nature of the real
> tables that I tested is that each row in the "products" table has at
> least one row in the "product_tags" table and only six distinct values
> from "tags" have been used so far in "product_tags".)
>
> I compared three different query strategies (shown below). Of these,
> METHOD 1 had the lowest overall cost in resources, followed closely by
> METHOD 3 (although the total elapsed time for METHOD3 was slightly
> faster than that for METHOD 1). I did notice that METHOD 3 grew more
> expensive the more tags I included in the search since each tag
> requires additional self-joins.
>
> Given that the code for METHOD 1 was the least complex (and therefore
> the easiest to define using Zend_Db_Select, just to keep this somewhat
> on topic) and had the lowest cost and second lowest overall elapsed
> time, I would go that route on SQL Server. As I understand, you're
> saying this is not necessarily the case for MySQL.
>
> -- METHOD 1
> SELECT P.product_id, P.product_name, P.price
> FROM products AS P
> INNER JOIN product_tags AS PT
> ON P.product_id = PT.product_id
> INNER JOIN tags AS T
> ON T.tag_id = PT.tag_id
> WHERE T.tag IN ( 'foo', 'bar', 'baz', 'floob', 'widget' )
> GROUP BY P.product_id, P.product_name, P.price
> HAVING COUNT(DISTINCT T.tag_id) = 5
>
>
> -- METHOD 2
> SELECT products.product_id, products.product_name, products.price
> FROM products
> WHERE EXISTS (
> SELECT product_tags.product_id
> FROM product_tags
> INNER JOIN tags
> ON tags.tag_id = product_tags.tag_id
> WHERE tags.tag IN ( 'foo', 'bar', 'baz',
> 'floob', 'widget' )
> AND product_tags.product_id =
> products.product_id
> GROUP BY product_tags.product_id
> HAVING COUNT(*) = 5
> )
>
>
> -- METHOD 3
> SELECT P.product_id, P.product_name, P.price
> FROM products AS P
> INNER JOIN product_tags AS PT1
> ON PT1.product_id = P.product_id
> INNER JOIN tags AS T1
> ON T1.tag_id = PT1.tag_id
> AND T1.tag = 'foo'
> INNER JOIN product_tags AS PT2
> ON PT2.product_id = P.product_id
> INNER JOIN tags AS T2
> ON T2.tag_id = PT2.tag_id
> AND T2.tag = 'bar'
> INNER JOIN product_tags AS PT3
> ON PT3.product_id = P.product_id
> INNER JOIN tags AS T3
> ON T3.tag_id = PT3.tag_id
> AND T3.tag = 'baz'
> INNER JOIN product_tags AS PT4
> ON PT4.product_id = P.product_id
> INNER JOIN tags AS T4
> ON T4.tag_id = PT4.tag_id
> AND T4.tag = 'floob'
> INNER JOIN product_tags AS PT5
> ON PT5.product_id = P.product_id
> INNER JOIN tags AS T5
> ON T5.tag_id = PT5.tag_id
> AND T5.tag = 'widget'
>
>
>
> Results:
> METHOD 1: Query cost as calculated by Query Analyzer (relative to
> batch) 3%
> ----------------
> Table 'products'. Scan count 1, logical reads 294, physical reads 0,
> read-ahead reads 0.
> Table 'product_tags'. Scan count 1, logical reads 344, physical reads
> 0, read-ahead reads 0.
> Table 'tags'. Scan count 1, logical reads 5, physical reads 0,
> read-ahead reads 0.
> SQL Server Execution Times:
> CPU time = 156 ms, elapsed time = 510 ms.
>
> (7220 row(s) affected)
>
> METHOD 2: Query cost as calculated by Query Analyzer (relative to
> batch) 89%
> ----------------
> Table 'tags'. Scan count 131715, logical reads 263430, physical reads
> 0, read-ahead reads 0.
> Table 'product_tags'. Scan count 53968, logical reads 108421, physical
> reads 0, read-ahead reads 0.
> Table 'products'. Scan count 1, logical reads 294, physical reads 0,
> read-ahead reads 0.
> SQL Server Execution Times:
> CPU time = 953 ms, elapsed time = 1021 ms.
>
> (7220 row(s) affected)
>
> METHOD 3: Query cost as calculated by Query Analyzer (relative to
> batch) 9%
> ----------------
> Table 'tags'. Scan count 5, logical reads 25, physical reads 0,
> read-ahead reads 0.
> Table 'product_tags'. Scan count 5, logical reads 1720, physical reads
> 0, read-ahead reads 0.
> Table 'products'. Scan count 1, logical reads 294, physical reads 0,
> read-ahead reads 0.
> SQL Server Execution Times:
> CPU time = 266 ms, elapsed time = 488 ms.
>
> (7220 row(s) affected)
>
>
> Andrew