On Jun 30, 2010, at 12:28 PM, Aurimas Likas wrote:

I have 3 DB tables: products, tags and products_tags. products_tags has 2 columns product_id and tag_id (for many-to-many relationship). So the question is: how can I select products which have tag1 AND tag2 and tag3 and .... tagN? I need a best way in terms of performance. The only way I can think of now is to create a query with self joins. The query joins number would equal tags number - 1.

Using N-1 self-joins is the best solution for performance, especially if you use MySQL. It appears to require a lot of joins, and common wisdom is that you should avoid joins.

SELECT p.* FROM products p
JOIN products_tags pt1 ON p.product_id = pt1.product_id
JOIN tags t1 ON pt1.tag_id = t1.tag_id AND t1.tag = 'medical'
JOIN products_tags pt2 ON p.product_id = pt2.product_id
JOIN tags t2 ON pt1.tag_id = t2.tag_id AND t2.tag = 'performance'
JOIN products_tags pt3 ON p.product_id = pt3.product_id
JOIN tags t3 ON pt3.tag_id = t3.tag_id AND t3.tag = 'reporting';

If you use natural keys for the tags table instead of a pseudokey, then your products_tags table contains the value you need, and you an eliminate half of those joins. For example:

CREATE TABLE tags (
  tag VARCHAR(20) PRIMARY KEY   -- natural key, not pseudokey
);

CREATE TABLE products_tags (
  product_id INT NOT NULL,
  tag VARCHAR(20) NOT NULL,  -- natural key, not pseudokey
  PRIMARY KEY (product_id, tag),
  FOREIGN KEY (product_id) REFERENCES products(product_id),
  FOREIGN KEY (tag) REFERENCES tags(tag)
);

SELECT p.* FROM products p
JOIN products_tags pt1 ON p.product_id = pt1.product_id AND pt1.tag = 'medical' JOIN products_tags pt2 ON p.product_id = pt2.product_id AND pt2.tag = 'performance' JOIN products_tags pt3 ON p.product_id = pt3.product_id AND pt3.tag = 'reporting';

The primary key index in products_tags gives this good performance. Your EXPLAIN report should say "Using index" which means it can satisfy the query simply by reading the index, without reading the data for the products_tags table at all. If your index is cached in database memory, this runs very fast.

A different solution that some people prefer is to use GROUP BY and include groups that have as many rows as the number of tags:

SELECT p.*
FROM products p
JOIN products_tags pt ON p.product_id = pt.product_id
WHERE pt.tag IN ('medical', 'performance', 'reporting')
GROUP BY p.product_id -- EXPLAIN reports "Using temporary; Using filesort"
HAVING COUNT(*) = 3;

Common wisdom says that one join is always better than three or six, but this actually performs much *worse* (if you use MySQL), because virtually any query using GROUP BY uses a temporary table, which kills performance because disk I/O is expensive, and temporary tables usually write to disk.

This might not be such a bad solution if you use some other brand of database that optimizes GROUP BY better than MySQL. So you should analyze queries with EXPLAIN (or equivalent) and profile carefully with sample data of realistic size.

Regards,
Bill Karwin

Reply via email to