On Jun 30, 2010, at 12:28 PM, Aurimas Likas wrote:
I have 3 DB tables: products, tags and products_tags. products_tags
has 2 columns product_id and tag_id (for many-to-many relationship).
So the question is: how can I select products which have tag1 AND
tag2 and tag3 and .... tagN? I need a best way in terms of
performance. The only way I can think of now is to create a query
with self joins. The query joins number would equal tags number - 1.
Using N-1 self-joins is the best solution for performance, especially
if you use MySQL. It appears to require a lot of joins, and common
wisdom is that you should avoid joins.
SELECT p.* FROM products p
JOIN products_tags pt1 ON p.product_id = pt1.product_id
JOIN tags t1 ON pt1.tag_id = t1.tag_id AND t1.tag = 'medical'
JOIN products_tags pt2 ON p.product_id = pt2.product_id
JOIN tags t2 ON pt1.tag_id = t2.tag_id AND t2.tag = 'performance'
JOIN products_tags pt3 ON p.product_id = pt3.product_id
JOIN tags t3 ON pt3.tag_id = t3.tag_id AND t3.tag = 'reporting';
If you use natural keys for the tags table instead of a pseudokey,
then your products_tags table contains the value you need, and you an
eliminate half of those joins. For example:
CREATE TABLE tags (
tag VARCHAR(20) PRIMARY KEY -- natural key, not pseudokey
);
CREATE TABLE products_tags (
product_id INT NOT NULL,
tag VARCHAR(20) NOT NULL, -- natural key, not pseudokey
PRIMARY KEY (product_id, tag),
FOREIGN KEY (product_id) REFERENCES products(product_id),
FOREIGN KEY (tag) REFERENCES tags(tag)
);
SELECT p.* FROM products p
JOIN products_tags pt1 ON p.product_id = pt1.product_id AND pt1.tag =
'medical'
JOIN products_tags pt2 ON p.product_id = pt2.product_id AND pt2.tag =
'performance'
JOIN products_tags pt3 ON p.product_id = pt3.product_id AND pt3.tag =
'reporting';
The primary key index in products_tags gives this good performance.
Your EXPLAIN report should say "Using index" which means it can
satisfy the query simply by reading the index, without reading the
data for the products_tags table at all. If your index is cached in
database memory, this runs very fast.
A different solution that some people prefer is to use GROUP BY and
include groups that have as many rows as the number of tags:
SELECT p.*
FROM products p
JOIN products_tags pt ON p.product_id = pt.product_id
WHERE pt.tag IN ('medical', 'performance', 'reporting')
GROUP BY p.product_id -- EXPLAIN reports "Using temporary; Using
filesort"
HAVING COUNT(*) = 3;
Common wisdom says that one join is always better than three or six,
but this actually performs much *worse* (if you use MySQL), because
virtually any query using GROUP BY uses a temporary table, which kills
performance because disk I/O is expensive, and temporary tables
usually write to disk.
This might not be such a bad solution if you use some other brand of
database that optimizes GROUP BY better than MySQL. So you should
analyze queries with EXPLAIN (or equivalent) and profile carefully
with sample data of realistic size.
Regards,
Bill Karwin