On Wed, Jun 30, 2010 at 4:46 PM, Bill Karwin <[email protected]> wrote:
> Common wisdom says that one join is always better than three or six, but
> this actually performs much *worse* (if you use MySQL), because virtually
> any query using GROUP BY uses a temporary table, which kills performance
> because disk I/O is expensive, and temporary tables usually write to disk.
>
> This might not be such a bad solution if you use some other brand of
> database that optimizes GROUP BY better than MySQL. So you should analyze
> queries with EXPLAIN (or equivalent) and profile carefully with sample data
> of realistic size.
>
> Regards,
> Bill Karwin
>
Interesting. I'll defer to you on MySQL since you know it much better
than I (and I don't have anything running on MySQL that I could test
quickly right now anyway). Still, the idea left me somewhat curious,
so I ran a test on one of our SQL Server databases to see how some
different strategies compared.
In the interest of disclosure, the actual tables I used (renamed in
the examples below) are not very large as databases go. The "products"
table has almost 54000 rows, the "tags" table has almost 350, and the
"product_tags" table has about 132000 rows. (The nature of the real
tables that I tested is that each row in the "products" table has at
least one row in the "product_tags" table and only six distinct values
from "tags" have been used so far in "product_tags".)
I compared three different query strategies (shown below). Of these,
METHOD 1 had the lowest overall cost in resources, followed closely by
METHOD 3 (although the total elapsed time for METHOD3 was slightly
faster than that for METHOD 1). I did notice that METHOD 3 grew more
expensive the more tags I included in the search since each tag
requires additional self-joins.
Given that the code for METHOD 1 was the least complex (and therefore
the easiest to define using Zend_Db_Select, just to keep this somewhat
on topic) and had the lowest cost and second lowest overall elapsed
time, I would go that route on SQL Server. As I understand, you're
saying this is not necessarily the case for MySQL.
-- METHOD 1
SELECT P.product_id, P.product_name, P.price
FROM products AS P
INNER JOIN product_tags AS PT
ON P.product_id = PT.product_id
INNER JOIN tags AS T
ON T.tag_id = PT.tag_id
WHERE T.tag IN ( 'foo', 'bar', 'baz', 'floob', 'widget' )
GROUP BY P.product_id, P.product_name, P.price
HAVING COUNT(DISTINCT T.tag_id) = 5
-- METHOD 2
SELECT products.product_id, products.product_name, products.price
FROM products
WHERE EXISTS (
SELECT product_tags.product_id
FROM product_tags
INNER JOIN tags
ON tags.tag_id = product_tags.tag_id
WHERE tags.tag IN ( 'foo', 'bar', 'baz',
'floob', 'widget' )
AND product_tags.product_id = products.product_id
GROUP BY product_tags.product_id
HAVING COUNT(*) = 5
)
-- METHOD 3
SELECT P.product_id, P.product_name, P.price
FROM products AS P
INNER JOIN product_tags AS PT1
ON PT1.product_id = P.product_id
INNER JOIN tags AS T1
ON T1.tag_id = PT1.tag_id
AND T1.tag = 'foo'
INNER JOIN product_tags AS PT2
ON PT2.product_id = P.product_id
INNER JOIN tags AS T2
ON T2.tag_id = PT2.tag_id
AND T2.tag = 'bar'
INNER JOIN product_tags AS PT3
ON PT3.product_id = P.product_id
INNER JOIN tags AS T3
ON T3.tag_id = PT3.tag_id
AND T3.tag = 'baz'
INNER JOIN product_tags AS PT4
ON PT4.product_id = P.product_id
INNER JOIN tags AS T4
ON T4.tag_id = PT4.tag_id
AND T4.tag = 'floob'
INNER JOIN product_tags AS PT5
ON PT5.product_id = P.product_id
INNER JOIN tags AS T5
ON T5.tag_id = PT5.tag_id
AND T5.tag = 'widget'
Results:
METHOD 1: Query cost as calculated by Query Analyzer (relative to batch) 3%
----------------
Table 'products'. Scan count 1, logical reads 294, physical reads 0,
read-ahead reads 0.
Table 'product_tags'. Scan count 1, logical reads 344, physical reads
0, read-ahead reads 0.
Table 'tags'. Scan count 1, logical reads 5, physical reads 0,
read-ahead reads 0.
SQL Server Execution Times:
CPU time = 156 ms, elapsed time = 510 ms.
(7220 row(s) affected)
METHOD 2: Query cost as calculated by Query Analyzer (relative to batch) 89%
----------------
Table 'tags'. Scan count 131715, logical reads 263430, physical reads
0, read-ahead reads 0.
Table 'product_tags'. Scan count 53968, logical reads 108421, physical
reads 0, read-ahead reads 0.
Table 'products'. Scan count 1, logical reads 294, physical reads 0,
read-ahead reads 0.
SQL Server Execution Times:
CPU time = 953 ms, elapsed time = 1021 ms.
(7220 row(s) affected)
METHOD 3: Query cost as calculated by Query Analyzer (relative to batch) 9%
----------------
Table 'tags'. Scan count 5, logical reads 25, physical reads 0,
read-ahead reads 0.
Table 'product_tags'. Scan count 5, logical reads 1720, physical reads
0, read-ahead reads 0.
Table 'products'. Scan count 1, logical reads 294, physical reads 0,
read-ahead reads 0.
SQL Server Execution Times:
CPU time = 266 ms, elapsed time = 488 ms.
(7220 row(s) affected)
Andrew