Re: select where multiple joined records match

Michael Stassen Mon, 21 Feb 2005 11:29:09 -0800

AM Thomas wrote:

Hi there,
I'll answer your questions below, but I should tell you that it looks like even three or four "joins" seem to make this too slow. Theoretically, the user could select up to 15 or so GoalNumber values, 12 for Grade, and about 20 possibilities for Subject - clearly it wouldn't be useful to specify that many items in a search, but that's the theoretical maximum; 4 or 5 values isn't unreasonable, though.

Four "AND"ed goal numbers plus a subject and a grade slowed the search (on the shared commercial web host I'm using) into the 3 minute range, and that's with a regular join, not a left join. This is the SELECT that took about 3 minutes (3 trials, simplifying slightly each time, simplest given here):

You've removed necessary conditions on the JOINs, so you are getting lots of extra rows. Furthermore, the logic isn't right, so I think this query will, in all likelihood, retrieve incorrect rows. I'll explain. For a given resource id,

select r.id from resources as r
  join resource_goals as g0 on (r.id=g0.ResourceID)

(adding the WHERE clause below), this retrieves every row in copy 0 with the right subject and grade, *regardless of goal*

join resource_goals as g1 on (r.id=g1.ResourceID and g1.GoalNumber=1)

and pairs it with every row in copy 1 with GoalNumber=1, *regardless of Subject or Grade*

join resource_goals as g2 on (r.id=g2.ResourceID and g2.GoalNumber=2)

and pairs it with every row in copy 2 with GoalNumber=2, *regardless of Subject or Grade*

join resource_goals as g3 on (r.id=g3.ResourceID and g3.GoalNumber=3)

and pairs it with every row in copy 3 with GoalNumber=3, *regardless of Subject or Grade*

join resource_goals as g4 on (r.id=g4.ResourceID and g4.GoalNumber=4)

and pairs it with every row in copy 4 with GoalNumber=4, *regardless of Subject or Grade*

where ((g0.Subject='Social_Studies')
  and (g0.Grade='4th'))
group by r.id;

and finally, we pick one of those many rows to display (the effect of the GROUP BY r.id).

Do you see why that is both more than and different from what you want? Consider a resource with the following rows in resource_goals:

+------------+-------+----------------+------------+--------------+
| ResourceID | Grade | Subject        | GoalNumber | NumericGrade |
+------------+-------+----------------+------------+--------------+
|         14 | 4th   | Social_Studies |          7 |            4 |
|         14 | 1st   | English        |          1 |            1 |
|         14 | 2nd   | English        |          2 |            2 |
|         14 | 3rd   | English        |          3 |            3 |
|         14 | 5th   | History        |          4 |            5 |
+------------+-------+----------------+------------+--------------+

ResourceID 14 would be returned by your query, but isn't what you want. I'd guess you haven't come across a case like this because you have few, if any, cross-subject resources, but I assume they are a possibility, since you have Subject part of the resource_goals table, rather than part of the resources table.

The fastest time was 2 min 48 sec.
Last time (simplest query) was 3 min 2 sec.

If we join each row in resources to a single row in each copy of resource_goals using an index, this should be reasonably fast, but I expect you are getting multiple matching rows in each copy, as it stands now. The total resulting rows per id is the product of the matches in each copy. If just 3 rows match your current conditions per copy, that would be 3^5 = 243 rows per resource id, where we expect only 1! In other words, I expect some of the slow down is due to the overhead of retrieving many times the number of desired rows. The rest is probably lack of a suitable index.

As I understand it, you are looking for a resource for 4th grade Social Studies which meets goals 1 through 4. In terms of your tables, that corresponds to having 4 rows in resources_goals, *all* of which have grade=4th and Subject='Social_Studies'. That is, we need to look in *4* copies of resources_goals (not 5). For a given resource id, we want exactly one row from each copy, namely, the row with the correct resource id, correct subject, correct grade, and desired goal number. I think this should do:

  SELECT r.id
  FROM resources as r
  JOIN resource_goals as g1
    ON  r.id = g1.ResourceID
    AND g1.Subject = 'Social_Studies'
    AND g1.Grade = '4th'
    AND g1.GoalNumber = 1
  JOIN resource_goals as g2
    ON  r.id = g2.ResourceID
    AND g2.Subject = 'Social_Studies'
    AND g2.Grade = '4th'
    AND g2.GoalNumber = 2
  JOIN resource_goals as g3
    ON  r.id = g3.ResourceID
    AND g3.Subject = 'Social_Studies'
    AND g3.Grade = '4th'
    AND g3.GoalNumber = 3
  JOIN resource_goals as g4
    ON  r.id = g4.ResourceID
    AND g4.Subject = 'Social_Studies'
    AND g4.Grade = '4th'
    AND g4.GoalNumber = 4;

That should return 1 row per resource, so long as there are no duplicates rows in resource_goals.

This should do the same thing:

  SELECT ResourceID
  FROM resource_goals
  WHERE Subject = 'Social_Studies'
    AND Grade = '4th'
    AND GoalNumber IN (1, 2, 3, 4)
  GROUP BY ResourceID
  HAVING COUNT(*) = 4;

You'll have to try both to see which is faster.

I'm really running out of time on this project, so I just went ahead and made the user interface such that users can only select one subject, grade, and/or goal number at a time. It's probably a sound decision from a usability perspective, so I'm not too sad. If I decide to make this work in the future, I'd probably have to just do a SELECT for each "AND"ed field, get the list of resource id's for each SELECT, then find the intersection of the lists in Perl.

If I could speed this up with some kind of indexing, I'd love to know about it.

You mentioned before that the combination of ResourceID, Subject, Grade, and GoalNumber will be unique. If you have not already done so, you should add a UNIQUE INDEX on that combination of columns.

  ALTER TABLE resource_goals
  ADD UNIQUE INDEX (ResourceID, Subject, Grade, GoalNumber);

This will enforce that constraint, and either query I gave above could use it to speed things up. If you wanr further comments on indexes, include the output from

  SHOW INDEXES FROM resources;
  SHOW INDEXES FROM resource_goals;

in your next message.

The GROUP BY phrase is because I wanted just one row per resource. It seemed like I'd get a row for each condition/resource (didn't test it with the final ). I'm actually doing SELECT * FROM... in my code, and not using the resource_goals information in my output (that's a separate view at present, generated by different Perl code).

When you get more rows than you expect, you should resist the urge to "fix" your query by adding DISTINCT or GROUP BY. Extra rows usually mean you've missed a condition. At best, such a query is inefficient, at worst, it gives wrong results. Either way, DISTINCT and GROUP BY are usually cover-ups, not solutions.

I'd also suggest not selecting columns you don't need. That adds overhead. Just select r.id or r.*, if that's all you need.

It seems to work fine without the NOT NULL parts, you're right. I was wondering about that, but was sleepy enough at the time that I didn't trust my thinking.

Sorry about not including my table defs :-( . I guess I was just hoping for a general approach, and didn't realize that anyone would be interested enough to read all that detail and provide an exact solution for me. Of course, now I realize that it would have simplified our discussion. Anyway, late but not never, and for help to whoever finds this in the list archives someday, here are my table defs (you're right - the goal number is a TINYINT):

(Below is an abridged version of the resources table ; it also contains about 60 more TINYINT fields which are essentially used as booleans, some of which I hope to eliminate. Yes, I could have used SET or something, but I didn't for various reasons.)
CREATE TABLE resources (
  id INT UNSIGNED PRIMARY KEY,
  Title TEXT,
  ResourceType_THJHArticle TINYINT,
  ResourceType_NIEArticle TINYINT,
  DataEntryName TINYTEXT,
  Date DATETIME,
  Notes TEXT,
  Made_Keywords TEXT);
CREATE TABLE  resource_goals (
  goal_id INT UNSIGNED PRIMARY KEY,
  ResourceID INT,
  Grade TINYTEXT,
  Subject TINYTEXT,
  GoalNumber TINYINT,
  NumericGrade TINYINT);


Some comments:

In general, tables with fixed-length rows are faster than tables with variable-length rows (with some wasted space as the tradeoff). To make a table have fixed-length rows requires every column to be fixed length. I doubt that's reasonable for table resources, but I'm guessing it's doable for table resource_goals. Even if not, it's still a good idea not to make columns any bigger than they need to be. From your example, I would expect that column Grade contains values such as '1st', '3rd', and '11th', so I would think a CHAR(4) would do. TINYTEXT is 255 characters wide. Similarly, I doubt you have any Subjects which approach 255 characters in length.

I see another column named NumericGrade. Is that simply the same info as is in column Grade, expressed as a number ('1st' = 1, '2nd' = 2, and so on)? If so, that's good news, for 2 reasons. First, numeric comparisons are faster than string comparisons, so we would make a couple of changes to my advice above. In that case, we'd use NumericGrade instead of Grade in the multi-column unique index, and we'd substitute NumericGrade = 4 for Grade='4th' in the queries.

Second, that would allow us to improve things by further normalizing your data. Having two columns with the same information is considered a bad idea, as it is wasteful of space and can impact speed. I assume the purpose of column Grade is to express the grade in a nicer form. Instead, we make a new table

  CREATE TABLE grade (id TINYINT PRIMARY KEY, name CHAR(4), INDEX (name));

  INSERT INTO grade
   SELECT DISTINCT NumericGrade, Grade FROM resource_goals
   ORDER BY NumericGrade;

which holds the mapping from number to text in just a few rows. Then we no longer need the Grade column in table resource_goals.

Similarly, we could normalize the Subjects:

  CREATE TABLE subject (id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
                        name CHAR(25), INDEX (name));

  INSERT INTO subject (name)
    SELECT Subject FROM resource_goals GROUP BY subject;

  ALTER TABLE resource_goals ADD subject_id INT UNSIGNED NOT NULL;

  UPDATE resource_goals JOIN subject
      ON resource_goals.Subject = subject.name
  SET resource_goals.subject_id = subject.id;

Now we no longer need the Subject column in resource_goals.

  ALTER TABLE resource_goals
  DROP Grade,
  DROP Subject;

We may then wish to rename column NumericGrade to just grade:

  ALTER TABLE resource_goals
  CHANGE NumericGrade grade TINYINT;

At this point, the multi-column index I mentioned above would be defined this way:

  ALTER TABLE resource_goals
  ADD UNIQUE INDEX (ResourceID, subject_id, grade, GoalNumber);

In general, we now get the strings for grade and subject by joining to the new tables on their respective IDs. In this particualr case, with about 12 grades and 20 subjects, it should be simple to select the grade and subject IDs and names from their respective tables, then use that to build your form with drop-down menus which show the names, but use the corresponding IDs in any query. Then, the queries above become (assuming 'Social_Studies' has ID = 14):

  SELECT r.id
  FROM resources as r
  JOIN resource_goals as g1
    ON  r.id = g1.ResourceID
    AND g1.subject_id = 14
    AND g1.grade = 4
    AND g1.GoalNumber = 1
  JOIN resource_goals as g2
    ON  r.id = g2.ResourceID
    AND g2.subject_id = 14
    AND g2.grade = 4
    AND g2.GoalNumber = 2
  JOIN resource_goals as g3
    ON  r.id = g3.ResourceID
    AND g3.subject_id = 14
    AND g3.grade = 4
    AND g3.GoalNumber = 3
  JOIN resource_goals as g4
    ON  r.id = g4.ResourceID
    AND g4.subject_id = 14
    AND g4.grade = 4
    AND g4.GoalNumber = 4;

or

  SELECT ResourceID
  FROM resource_goals
  WHERE subject_id = 14
    AND grade = 4
    AND GoalNumber IN (1, 2, 3, 4)
  GROUP BY ResourceID
  HAVING COUNT(*) = 4;

The latter method, by the way, allows some flexibility. For example, you could change the HAVING clause to "HAVING COUNT(*) >= 3" to find resources which meet at least 3 of the specified goals.

Thanks a bunch for your help; I'm finding this more interesting than I thought I would.


You're welcome.

Michael

--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Re: select where multiple joined records match

Reply via email to