I am still stuck with my full text search engine. I have experemented
with
different approaches to seleting search results and figure out that
having temporary tables is about 300 times faster than doing 'inner
joins'


The table:

+-------+------------------+------+-----+---------+-------+
| Field | Type             | Null | Key | Default | Extra |
+-------+------------------+------+-----+---------+-------+
| l_id  | int(10) unsigned |      | PRI | 0       |       |
| w_id  | int(10) unsigned |      | PRI | 0       |       |
+-------+------------------+------+-----+---------+-------+

That the only used table in the query. It contains
law id and word id. For each law there are many words.
There is a vocabulary which hold the word, but that does not
matter, since only law_words table is used.

The query is:

SELECT DISTINCT  w0.l_id FROM   law_words as w0
inner join law_words as w1 on w0.l_id=w1.l_id
inner join law_words as w2 on w0.l_id=w2.l_id
inner join law_words as w3 on w0.l_id=w3.l_id
WHERE
w0.w_id  IN (258,282,287, 615, 1101, 1949, 1968, 3417, 3574, 3578,
3643,
4345,4768, 5297, 5976,6133, 7243,7245, 9271, 9348, 11146, 11150,
11172,
11232,11847, 12542, 12859, 14811, 24839, 26653,27662)
AND
w1.w_id IN (405, 2017,2192, 2592, 2595, 2603, 2981, 4055, 4068,
4346,5755,
6480, 9384, 9408, 11513, 11514, 12126, 12134, 12638, 13052, 13643,
13769,
13836, 13945, 14154, 14693, 14867, 14980, 15518, 15557, 17830, 19005,
19051,
19247, 20176, 20926, 22364, 22365, 22366, 22732, 24668, 24793, 24956,
25286,
26242, 26665, 26847, 27144, 27348, 27815, 28494, 30910, 31878, 32161,
33586,  34396)
AND
w2.w_id  IN (258,282,287, 615, 1101, 1949, 1968, 3417, 3574, 3578,
3643,
4345,4768, 5297, 5976,6133, 7243,7245, 9271, 9348, 11146, 11150,
11172,
11232,11847, 12542, 12859, 14811, 24839, 26653,27662)
AND
w3.w_id  IN (258,282,287, 615, 1101, 1949, 1968, 3417, 3574, 3578,
3643,
4345,4768, 5297, 5976,6133, 7243,7245, 9271, 9348, 11146, 11150,
11172,
11232,11847, 12542, 12859, 14811, 24839, 26653,27662)
;

It selects all laws which have all 4 words which user entered in the
search form. For example, if user entered 'look', then get ids from
the vocabulary of words: look, looks, looked, looky, etc, so we
get a list of ids for the first word. The same goes for all other
words. Then, since the search is done using AND (all words) we
need to join the table with itself, so only the laws which have
ALL the words are left at the end.

The above query takes about 30 seconds to complete.

Explain shows:
+-------+-------+---------------+---------+---------+------+------+---
---------------------------------------+
| table | type  | possible_keys | key     | key_len | ref  | rows |
Extra                                    |
+-------+-------+---------------+---------+---------+------+------+---
---------------------------------------+
| w0    | range | PRIMARY       | PRIMARY |       4 | NULL |  474 |
where used; Using index; Using temporary |
| w2    | range | PRIMARY       | PRIMARY |       4 | NULL |  474 |
where used; Using index; Distinct        |
| w3    | range | PRIMARY       | PRIMARY |       4 | NULL |  474 |
where used; Using index; Distinct        |
| w1    | range | PRIMARY       | PRIMARY |       4 | NULL |  477 |
where used; Using index; Distinct        |
+-------+-------+---------------+---------+---------+------+------+---
---------------------------------------+
4 rows in set (0.02 sec)

Show index:
+-----------+------------+----------+--------------+-------------+----
-------+-------------+----------+--------+---------+
| Table     | Non_unique | Key_name | Seq_in_index | Column_name |
Collation | Cardinality | Sub_part | Packed | Comment |
+-----------+------------+----------+--------------+-------------+----
-------+-------------+----------+--------+---------+
| law_words |          0 | PRIMARY  |            1 | w_id        | A
|        NULL |     NULL | NULL   |         |
| law_words |          0 | PRIMARY  |            2 | l_id        | A
|      228208 |     NULL | NULL   |         |
+-----------+------------+----------+--------------+-------------+----
-------+-------------+----------+--------+---------+
2 rows in set (0.00 sec)


There are 228208 records in the table.

If more words specified in a search query
search takes longer than TCP/IP timeout,
so user never gets the result.

I don't understand what's wrong with the query and why it takes
so long, so today i tried a different approach.

I have created 4 temporary tables:
create temporary table t1 (id int unsigned not null, primary key
(id));
create temporary table t2 (id int unsigned not null, primary key
(id));
create temporary table t3 (id int unsigned not null, primary key
(id));
create temporary table t4 (id int unsigned not null, primary key
(id));

Done 4 separate queries:

INSERT INTO t1 SELECT DISTINCT  w0.l_id FROM   law_words as w0
WHERE
w0.w_id  IN (258,282,287, 615, 1101, 1949, 1968, 3417, 3574, 3578,
3643,
4345,4768, 5297, 5976,6133, 7243,7245, 9271, 9348, 11146, 11150,
11172,
11232,11847, 12542, 12859, 14811, 24839, 26653,27662);

INSERT INTO t2 SELECT DISTINCT  w1.l_id FROM   law_words as w1
WHERE
w1.w_id IN (405, 2017,2192, 2592, 2595, 2603, 2981, 4055, 4068,
4346,5755,
6480, 9384, 9408, 11513, 11514, 12126, 12134, 12638, 13052, 13643,
13769,
13836, 13945, 14154, 14693, 14867, 14980, 15518, 15557, 17830, 19005,
19051,
19247, 20176, 20926, 22364, 22365, 22366, 22732, 24668, 24793, 24956,
25286,
26242, 26665, 26847, 27144, 27348, 27815, 28494, 30910, 31878, 32161,
33586,  34396);

INSERT INTO t3 SELECT DISTINCT  w3.l_id FROM   law_words as w3
WHERE
w3.w_id  IN (258,282,287, 615, 1101, 1949, 1968, 3417, 3574, 3578,
3643,
4345,4768, 5297, 5976,6133, 7243,7245, 9271, 9348, 11146, 11150,
11172,
11232,11847, 12542, 12859, 14811, 24839, 26653,27662);

INSERT INTO t4 SELECT DISTINCT  w4.l_id FROM   law_words as w4
WHERE
w4.w_id IN (405, 2017,2192, 2592, 2595, 2603, 2981, 4055, 4068,
4346,5755,
6480, 9384, 9408, 11513, 11514, 12126, 12134, 12638, 13052, 13643,
13769,
13836, 13945, 14154, 14693, 14867, 14980, 15518, 15557, 17830, 19005,
19051,
19247, 20176, 20926, 22364, 22365, 22366, 22732, 24668, 24793, 24956,
25286,
26242, 26665, 26847, 27144, 27348, 27815, 28494, 30910, 31878, 32161,
33586,  34396);

And then done:

select t1.id from t1
inner join t2 on t1.id=t2.id
inner join t3 on t3.id=t1.id
inner join t4 on t4.id=t1.id;

It returned the same result as the full query, but! it took only
0.1 second! Each 'insert into t4' takes 0.02, creation takes 0.01
and the last select with join take from 0 to 0.01 seconds.

Now, why on earth this would be 300 times faster than the above
query? It seems as mysql just can't optimize the query
properly.

Also, maybe someone got a link to articles describe effective
technique for building full-text search engine?

Regards, Artem



---------------------------------------------------------------------
Before posting, please check:
   http://www.mysql.com/manual.php   (the manual)
   http://lists.mysql.com/           (the list archive)

To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php

Reply via email to