> -----Original Message----- > From: shawn green [mailto:shawn.l.gr...@oracle.com] > Sent: Tuesday, June 11, 2013 2:16 PM > To: mysql@lists.mysql.com > Subject: Re: How do I select all rows of table that have some rows in > another table (AND, not OR) > > Hello Daevid, > > On 6/11/2013 3:59 PM, Daevid Vincent wrote: > > I am trying to implement a filter so that a user could select various > genres > > they want "in" or "out". Perhaps they like 'action' and 'car chases' but > > don't like 'foreign' and 'drifting' (or whatever...) > > > > So I want something sort of like this, however IN() is using an "OR" > > comparison when I need it to be an "AND" > > > > SELECT DISTINCT > > s.`scene_id` AS `id`, > > GROUP_CONCAT(sg.`genre_id`) > > FROM > > `dvds` AS d > > JOIN `scenes_list` AS s > > ON s.`dvd_id` = d.`dvd_id` > > JOIN `scenes_genres` AS sg > > ON sg.`scene_id` = s.`scene_id` > > AND sg.`genre_id` IN (10,38) > > AND sg.`genre_id` NOT IN (22,61) > > GROUP BY s.`scene_id`; > > > > This is giving me way way too many rows returned. > > > > For example, I would expect this scene_id to be in the result set: > > > > SELECT * FROM scenes_genres WHERE scene_id = 17; > > > > scene_id genre_id > > -------- ---------- > > 17 1 > > 17 3 > > 17 10 <-- > > 17 19 > > 17 38 <-- > > 17 53 > > 17 58 > > 17 59 > > > > And this scene ID to NOT be in the result set: > > > > SELECT * FROM scenes_genres WHERE scene_id = 11; > > > > scene_id genre_id > > -------- ---------- > > 11 1 > > 11 10 <-- > > 11 19 > > 11 31 > > 11 32 > > <-- but does not have 38 > > 11 59 > > > > I've tried various subselect ideas, but of course this fails b/c genre_id > > can't be multiple things at one time (AND) > > > > JOIN `scenes_genres` AS sg > > ON sg.`scene_id` = s.`scene_id` > > AND sg.`genre_id` IN ( > > SELECT `genre_id` FROM `scenes_genres` > > WHERE `genre_id` = 10 > > AND `genre_id` = 38 > > AND `genre_id` <> 22 > > AND `genre_id` <> 61 > > ) > > > > And straight up like this failure too... > > > > JOIN `scenes_genres` AS sg > > ON sg.`scene_id` = s.`scene_id` > > AND (sg.`genre_id` = 10 AND sg.`genre_id` = 38) > > AND NOT (sg.`genre_id` = 22 AND sg.`genre_id` = 61) > > > > So I'm sort of out of ideas at this point and hoping someone has a way to > do > > this. > > > > Also, just for S&G this is how we are currently implementing it, but we > feel > > the REGEXP is killing our queries and while "clever" is a bit hacky and > > nullifies any indexes we have on the genres column as it requires a > > file_sort table scan to compare substrings basically... > > > > SELECT * FROM scene_all_genres WHERE scene_id = 17; > > > > scene_id genres > > -------- ------------------------ > > 17 1|3|10|19|38|53|58|59| > > > > SELECT * FROM scene_all_genres WHERE scene_id = 11; > > > > scene_id genres > > -------- ------------------- > > 11 1|10|19|31|32|59| > > > > SELECT DISTINCT > > s.`scene_id` AS `id`, > > sg.`genres` > > FROM > > `scene_all_genres` AS sg, > > `dvds` AS d, > > `scenes_list` AS s > > WHERE dvd_id` = d.`dvd_id` > > AND sg.`scene_id` = s.`scene_id` > > AND sg.`genres` REGEXP '[[:<:]]10[[:>:]].*[[:<:]]38[[:>:]]' > > AND sg.`genres` NOT REGEXP > '(([[:<:]]22[[:>:]])|([[:<:]]61[[:>:]]))' > > > > ; > > > > http://dev.mysql.com/doc/refman/5.0/en/regexp.html#operator_regexp > > > > [[:<:]], [[:>:]] > > > > These markers stand for word boundaries. They match the beginning and end > of > > words, respectively. A word is a sequence of word characters that is not > > preceded by or followed by word characters. A word character is an > > alphanumeric character in the alnum class or an underscore (_). > > > > To me it looks like you want to know how to match N for N when looking > for subset of properties. It's a multi-step process using plain SQL but > it's fast: > > 1) Build a list of scenes containing the genres you want to see > > CREATE TEMPORARY TABLE tmpList(key(scene_id, genre_id)) > SELECT scene_id, genre_id > FROM scenes_genres > WHERE genre_id` IN (10,38) <-- in this case you have 2 terms to mach > > 2) From that list, determine which of those scenes also contain unwanted > genres and remove them. > > DELETE tmpList > FROM tmpList t > INNER JOIN scenes_genres sg > on sg.scene_id = t.scene_id > and sg.genre_id IN (22,61) > > # at this point, tmpList contains all scenes that have any of the > desired genres but none of the unwanted ones. > > 3) Check to see if any scene has all N matches. > > SELECT scene_id, count(genre_id) as matches > FROM tmpList > GROUP BY scene_id > HAVING matches = 2 > /* ^--- this is the number of terms you are trying to match. */ > > > Of course you can modify the last query to eliminate the HAVING clause > and pick the top 5 matching scenes (even if they are partial matches) > like this > > SELECT scene_id, count(genre_id) as matches > FROM tmpList > GROUP BY scene_id > ORDER BY matches DESC > LIMIT 5 > > Let us know if this is not what you wanted to do. > > > > Note to the rest of the list: > > Yes, we could have done most of this with self-joins or subqueries, an > EXISTS, and a NOT EXISTS inside a single statement but I think the > execution time would have been miserable. I also think that that > approach also would not have allowed us to evaluate a partial match > (like 5 out of 7 target genres), only complete matches would have been > returned. > > I am still very interested in seeing alternative solutions :) > > -- > Shawn Green > MySQL Principal Technical Support Engineer > Oracle USA, Inc. - Hardware and Software, Engineered to Work Together. > Office: Blountville, TN
Shawn, thank you for taking the time to reply. I wasn't expecting the solution to be so much "work" with multiple statements like that. I was thinking it could be done in one (or two, as in split out a portion of it in PHP and re-insert it to the original SQL to avoid a JOIN or something). Part of the issue is that we use PHP to generate the $sql string by appending bits and pieces depending on the search criteria thereby keeping the 'path' through the SQL statement simple and relatively linear. To implement this would require significant re-writing and/or special cases where we could introduce errors or omissions in the future. The frustrating part is that the REGEXP query we use now only takes about 2 seconds on my DEV VM (same database as PROD), however when the RDBMS is loaded it then takes up to 30 seconds so in theory it's not even that inefficient given the # rows. We do use memcached for the results, but since there are so many combinations a user could choose, our hit ratio is not so great and therefore the cache isn't doing us much good and this is why the RDBMS can get loaded up easily. How can an "OR" be so simple using IN() but "AND" be so overly complex? Seems that mysql should have another function for ALL() that works just like IN() to handle this kind of scenario. -- MySQL General Mailing List For list archives: http://lists.mysql.com/mysql To unsubscribe: http://lists.mysql.com/mysql