Thinking out of the box... (And posting my reply at the 'wrong' end of the 
email.)...

Are there fewer than 64 genres?  Use a SET or BIGINT UNSIGNED.

> > >                  AND sg.`genre_id` IN (10,38)
> > >                  AND sg.`genre_id` NOT IN (22,61)

-->
    AND  genre & ((1<<10) | (1<<38)) != 0
    AND  genre & ((1<<22) | (1<<61))  = 0

This would avoid having that extra table, and save a lot of space.

If you have more than 64 genres, then 
Plan A: clump them into some kind of application grouping and use multiple 
INTs/SETs.
Plan B: do mod & div arithmetic to compute which genre field to tackle.

For B, something like:
    AND  (genre1 & (1<<0)) + (genre3 & (1<<8)) != 0
    AND  (genre2 & (1<<2)) + (genre6 & (1<<1))  = 0
(That's assuming 10 bits per genre# field.  I would use 32 and INT UNSIGNED.)



> -----Original Message-----
> From: Daevid Vincent [mailto:dae...@daevid.com]
> Sent: Tuesday, June 11, 2013 4:17 PM
> To: mysql@lists.mysql.com
> Cc: 'shawn green'
> Subject: RE: How do I select all rows of table that have some rows in
> another table (AND, not OR)
> 
> 
> 
> > -----Original Message-----
> > From: shawn green [mailto:shawn.l.gr...@oracle.com]
> > Sent: Tuesday, June 11, 2013 2:16 PM
> > To: mysql@lists.mysql.com
> > Subject: Re: How do I select all rows of table that have some rows in
> > another table (AND, not OR)
> >
> > Hello Daevid,
> >
> > On 6/11/2013 3:59 PM, Daevid Vincent wrote:
> > > I am trying to implement a filter so that a user could select
> > > various
> > genres
> > > they want "in" or "out". Perhaps they like 'action' and 'car chases'
> > > but don't like 'foreign' and 'drifting' (or whatever...)
> > >
> > > So I want something sort of like this, however IN() is using an "OR"
> > > comparison when I need it to be an "AND"
> > >
> > > SELECT DISTINCT
> > >          s.`scene_id` AS `id`,
> > >          GROUP_CONCAT(sg.`genre_id`) FROM
> > >          `dvds` AS d
> > >          JOIN `scenes_list` AS s
> > >                  ON s.`dvd_id` = d.`dvd_id`
> > >          JOIN `scenes_genres` AS sg
> > >                  ON sg.`scene_id` = s.`scene_id`
> > >                  AND sg.`genre_id` IN (10,38)
> > >                  AND sg.`genre_id` NOT IN (22,61) GROUP BY
> > > s.`scene_id`;
> > >
> > > This is giving me way way too many rows returned.
> > >
> > > For example, I would expect this scene_id to be in the result set:
> > >
> > > SELECT * FROM scenes_genres WHERE scene_id = 17;
> > >
> > > scene_id  genre_id
> > > --------  ----------
> > >        17           1
> > >        17           3
> > >        17          10 <--
> > >        17          19
> > >        17          38 <--
> > >        17          53
> > >        17          58
> > >        17          59
> > >
> > > And this scene ID to NOT be in the result set:
> > >
> > > SELECT * FROM scenes_genres WHERE scene_id = 11;
> > >
> > > scene_id  genre_id
> > > --------  ----------
> > >        11           1
> > >        11          10 <--
> > >        11          19
> > >        11          31
> > >        11          32
> > >                  <-- but does not have 38
> > >        11          59
> > >
> > > I've tried various subselect ideas, but of course this fails b/c
> genre_id
> > > can't be multiple things at one time (AND)
> > >
> > >     JOIN `scenes_genres` AS sg
> > >       ON sg.`scene_id` = s.`scene_id`
> > >          AND sg.`genre_id` IN (
> > >            SELECT `genre_id` FROM `scenes_genres`
> > >             WHERE `genre_id` = 10
> > >                    AND `genre_id` = 38
> > >                    AND `genre_id` <> 22
> > >                    AND `genre_id` <> 61
> > >         )
> > >
> > > And straight up like this failure too...
> > >
> > >    JOIN `scenes_genres` AS sg
> > >      ON sg.`scene_id` = s.`scene_id`
> > >         AND (sg.`genre_id` = 10 AND sg.`genre_id` = 38)
> > >         AND NOT (sg.`genre_id` = 22 AND sg.`genre_id` = 61)
> > >
> > > So I'm sort of out of ideas at this point and hoping someone has a
> > > way
> to
> > do
> > > this.
> > >
> > > Also, just for S&G this is how we are currently implementing it, but
> > > we
> > feel
> > > the REGEXP is killing our queries and while "clever" is a bit hacky
> > > and nullifies any indexes we have on the genres column as it
> > > requires a file_sort table scan to compare substrings basically...
> > >
> > > SELECT * FROM scene_all_genres WHERE scene_id = 17;
> > >
> > > scene_id  genres
> > > --------  ------------------------
> > >        17  1|3|10|19|38|53|58|59|
> > >
> > > SELECT * FROM scene_all_genres WHERE scene_id = 11;
> > >
> > > scene_id  genres
> > > --------  -------------------
> > >        11  1|10|19|31|32|59|
> > >
> > > SELECT DISTINCT
> > >          s.`scene_id` AS `id`,
> > >          sg.`genres`
> > > FROM
> > >          `scene_all_genres` AS sg,
> > >          `dvds` AS d,
> > >          `scenes_list` AS s
> > > WHERE      dvd_id` = d.`dvd_id`
> > >          AND sg.`scene_id` = s.`scene_id`
> > >          AND sg.`genres` REGEXP '[[:<:]]10[[:>:]].*[[:<:]]38[[:>:]]'
> > >          AND sg.`genres` NOT REGEXP
> > '(([[:<:]]22[[:>:]])|([[:<:]]61[[:>:]]))'
> > >
> > > ;
> > >
> > > http://dev.mysql.com/doc/refman/5.0/en/regexp.html#operator_regexp
> > >
> > > [[:<:]], [[:>:]]
> > >
> > > These markers stand for word boundaries. They match the beginning
> > > and
> end
> > of
> > > words, respectively. A word is a sequence of word characters that is
> > > not preceded by or followed by word characters. A word character is
> > > an alphanumeric character in the alnum class or an underscore (_).
> > >
> >
> > To me it looks like you want to know how to match N for N when looking
> > for subset of properties. It's a multi-step process using plain SQL
> > but it's fast:
> >
> > 1) Build a list of scenes containing the genres you want to see
> >
> > CREATE TEMPORARY TABLE tmpList(key(scene_id, genre_id)) SELECT
> > scene_id, genre_id FROM scenes_genres WHERE genre_id` IN (10,38)  <--
> > in this case you have 2 terms to mach
> >
> > 2) From that list, determine which of those scenes also contain
> > unwanted genres and remove them.
> >
> > DELETE tmpList
> > FROM tmpList t
> > INNER JOIN scenes_genres sg
> >    on sg.scene_id = t.scene_id
> >    and sg.genre_id IN (22,61)
> >
> > # at this point, tmpList contains all scenes that have any of the
> > desired genres but none of the unwanted ones.
> >
> > 3) Check to see if any scene has all N matches.
> >
> > SELECT scene_id, count(genre_id) as matches FROM tmpList GROUP BY
> > scene_id HAVING matches = 2
> > /* ^--- this is the number of terms you are trying to match. */
> >
> >
> > Of course you can modify the last query to eliminate the HAVING clause
> > and pick the top 5 matching scenes (even if they are partial matches)
> > like this
> >
> > SELECT scene_id, count(genre_id) as matches
> > FROM tmpList
> > GROUP BY scene_id
> > ORDER BY matches DESC
> > LIMIT 5
> >
> > Let us know if this is not what you wanted to do.
> >
> >
> >
> > Note to the rest of the list:
> >
> > Yes, we could have done most of this with self-joins or subqueries, an
> > EXISTS, and a NOT EXISTS inside a single statement but I think the
> > execution time would have been miserable. I also think that that
> > approach also would not have allowed us to evaluate a partial match
> > (like 5 out of 7 target genres), only complete matches would have been
> > returned.
> >
> > I am still very interested in seeing alternative solutions :)
> >
> > --
> > Shawn Green
> > MySQL Principal Technical Support Engineer
> > Oracle USA, Inc. - Hardware and Software, Engineered to Work Together.
> > Office: Blountville, TN
> 
> Shawn, thank you for taking the time to reply. I wasn't expecting the
> solution to be so much "work" with multiple statements like that. I was
> thinking it could be done in one (or two, as in split out a portion of it
> in
> PHP and re-insert it to the original SQL to avoid a JOIN or something).
> Part
> of the issue is that we use PHP to generate the $sql string by appending
> bits and pieces depending on the search criteria thereby keeping the
> 'path'
> through the SQL statement simple and relatively linear. To implement this
> would require significant re-writing and/or special cases where we could
> introduce errors or omissions in the future. The frustrating part is that
> the REGEXP query we use now only takes about 2 seconds on my DEV VM (same
> database as PROD), however when the RDBMS is loaded it then takes up to 30
> seconds so in theory it's not even that inefficient given the # rows. We
> do
> use memcached for the results, but since there are so many combinations a
> user could choose, our hit ratio is not so great and therefore the cache
> isn't doing us much good and this is why the RDBMS can get loaded up
> easily.
> 
> How can an "OR" be so simple using IN() but "AND" be so overly complex?
> Seems that mysql should have another function for ALL() that works just
> like
> IN() to handle this kind of scenario.
> 
> 
> 
> 
> --
> MySQL General Mailing List
> For list archives: http://lists.mysql.com/mysql
> To unsubscribe:    http://lists.mysql.com/mysql


--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/mysql

Reply via email to