On Wed, Jun 30, 2010 at 4:46 PM, Bill Karwin <[email protected]> wrote:
> Common wisdom says that one join is always better than three or six, but
> this actually performs much *worse* (if you use MySQL), because virtually
> any query using GROUP BY uses a temporary table, which kills performance
> because disk I/O is expensive, and temporary tables usually write to disk.
>
> This might not be such a bad solution if you use some other brand of
> database that optimizes GROUP BY better than MySQL.  So you should analyze
> queries with EXPLAIN (or equivalent) and profile carefully with sample data
> of realistic size.
>
> Regards,
> Bill Karwin
>

Interesting. I'll defer to you on MySQL since you know it much better
than I (and I don't have anything running on MySQL that I could test
quickly right now anyway). Still, the idea left me somewhat curious,
so I ran a test on one of our SQL Server databases to see how some
different strategies compared.

In the interest of disclosure, the actual tables I used (renamed in
the examples below) are not very large as databases go. The "products"
table has almost 54000 rows, the "tags" table has almost 350, and the
"product_tags" table has about 132000 rows. (The nature of the real
tables that I tested is that each row in the "products" table has at
least one row in the "product_tags" table and only six distinct values
from "tags" have been used so far in "product_tags".)

I compared three different query strategies (shown below). Of these,
METHOD 1 had the lowest overall cost in resources, followed closely by
METHOD 3 (although the total elapsed time for METHOD3 was slightly
faster than that for METHOD 1). I did notice that METHOD 3 grew more
expensive the more tags I included in the search since each tag
requires additional self-joins.

Given that the code for METHOD 1 was the least complex (and therefore
the easiest to define using Zend_Db_Select, just to keep this somewhat
on topic) and had the lowest cost and second lowest overall elapsed
time, I would go that route on SQL Server. As I understand, you're
saying this is not necessarily the case for MySQL.

-- METHOD 1
SELECT      P.product_id, P.product_name, P.price
FROM        products AS P
INNER JOIN  product_tags AS PT
                ON  P.product_id = PT.product_id
INNER JOIN  tags AS T
                ON  T.tag_id = PT.tag_id
WHERE       T.tag IN ( 'foo', 'bar', 'baz', 'floob', 'widget' )
GROUP BY    P.product_id, P.product_name, P.price
HAVING      COUNT(DISTINCT T.tag_id) = 5


-- METHOD 2
SELECT      products.product_id, products.product_name, products.price
FROM        products
WHERE       EXISTS (
                SELECT      product_tags.product_id
                FROM        product_tags
                INNER JOIN  tags
                                ON  tags.tag_id = product_tags.tag_id
                WHERE       tags.tag IN ( 'foo', 'bar', 'baz',
'floob', 'widget' )
                  AND       product_tags.product_id = products.product_id
                GROUP BY    product_tags.product_id
                HAVING      COUNT(*) = 5
            )


-- METHOD 3
SELECT      P.product_id, P.product_name, P.price
FROM        products AS P
INNER JOIN  product_tags AS PT1
                ON  PT1.product_id = P.product_id
INNER JOIN  tags AS T1
                ON  T1.tag_id = PT1.tag_id
                AND T1.tag = 'foo'
INNER JOIN  product_tags AS PT2
                ON  PT2.product_id = P.product_id
INNER JOIN  tags AS T2
                ON  T2.tag_id = PT2.tag_id
                AND T2.tag = 'bar'
INNER JOIN  product_tags AS PT3
                ON  PT3.product_id = P.product_id
INNER JOIN  tags AS T3
                ON  T3.tag_id = PT3.tag_id
                AND T3.tag = 'baz'
INNER JOIN  product_tags AS PT4
                ON  PT4.product_id = P.product_id
INNER JOIN  tags AS T4
                ON  T4.tag_id = PT4.tag_id
                AND T4.tag = 'floob'
INNER JOIN  product_tags AS PT5
                ON  PT5.product_id = P.product_id
INNER JOIN  tags AS T5
                ON  T5.tag_id = PT5.tag_id
                AND T5.tag = 'widget'



Results:
METHOD 1: Query cost as calculated by Query Analyzer (relative to batch) 3%
----------------
Table 'products'. Scan count 1, logical reads 294, physical reads 0,
read-ahead reads 0.
Table 'product_tags'. Scan count 1, logical reads 344, physical reads
0, read-ahead reads 0.
Table 'tags'. Scan count 1, logical reads 5, physical reads 0,
read-ahead reads 0.
SQL Server Execution Times:
   CPU time = 156 ms,  elapsed time = 510 ms.

(7220 row(s) affected)

METHOD 2: Query cost as calculated by Query Analyzer (relative to batch) 89%
----------------
Table 'tags'. Scan count 131715, logical reads 263430, physical reads
0, read-ahead reads 0.
Table 'product_tags'. Scan count 53968, logical reads 108421, physical
reads 0, read-ahead reads 0.
Table 'products'. Scan count 1, logical reads 294, physical reads 0,
read-ahead reads 0.
SQL Server Execution Times:
   CPU time = 953 ms,  elapsed time = 1021 ms.

(7220 row(s) affected)

METHOD 3: Query cost as calculated by Query Analyzer (relative to batch) 9%
----------------
Table 'tags'. Scan count 5, logical reads 25, physical reads 0,
read-ahead reads 0.
Table 'product_tags'. Scan count 5, logical reads 1720, physical reads
0, read-ahead reads 0.
Table 'products'. Scan count 1, logical reads 294, physical reads 0,
read-ahead reads 0.
SQL Server Execution Times:
   CPU time = 266 ms,  elapsed time = 488 ms.

(7220 row(s) affected)


Andrew

Reply via email to