I'm using MySQL for an information retrieval application where word
occurrences are indexed. It seems that performance is not as good as
I would expect (it seems nearly linear with the number of rows).
Any advice would be welcome. I'll lay out a lot of detail.
Some details follow. The basic scenario is:
- there are over 498M rows. This seems to qualify it as a
rather large table
- the table has 6 fixed-length fields
- I have created indexes on each field
- I set MAX_ROW_SIZE to 550,000,000
- I've tried this with both MyISAM and INNODB tables, with very
similar performance
- Slower inserts are not really a problem, if I could get faster searching
- I've run ANALYZE, OPTIMIZE, myisamchk etc. (see below)
- Using a socket for local communication; MySQL++ for apps, but I've
also confirmed benchmarking results using the mysql> command line
- I'm not presenting a lot of variety in queries and response times below,
but can send more if needed.
- Due to the patterns in the data, there are many many duplicate
values in the indexes. I suspect this might be important.
- This is not a results transport issue. I'm able to retrieve
about 1000 rows/second, which is OK. It's generating the result set,
prior to results transport, which is slow.
Queries are definitely disk-bound, with 5-40% of CPU utilization
during a long-running query. (I did a little experimentation with a
ramfs to confirm this...too bad I don't have 40GB of RAM).
The application: I want to run queries that will identify particular
DOCUMENTS (via "docid") that have particular TERMS (via "termid").
This is your garden variety task for a search engine or other
information retrieval system. I often submit additional criteria,
such as adjacent terms (offset A - offset B=1) or terms in a
particular tag (term A in <title> and term B in <title>) or proximity
(paragraph A == paragraph B). Being able to do such powerful queries
is the whole point of using MySQL, rather than trying to manage things
myself (using BerkeleyDB or home-grown B-trees, as I've done in the
past).
I have made a quickie 2000-document, then a longer 100,000 document,
then the full 1.2M document database (input is about 20GB of HTML).
The challenge is that on the 1.2M document dataset with nearly 1/2 a
billion rows (term occurrences), I just can't seem to get a quick
response for combinations with terms occurring in a lot of rows.
Maybe the answer is that this is the best I can expect due to
duplicate key values and other factors (that is, there are many terms
with each docid....many many terms with each paragraph id....many many
many terms with each offset, etc.). But I hope there might be other
tuning parameters or tips that will speed things up.
If I look for terms in only a few hundred rows, no problem. But for
terms with thousands of rows, it takes a very long time (several
minutes for the first 1000 rows to an hour for all rows) to get the
query output, even when I order by and limit to get a smaller output
set.
Concrete example:
select * from inv0web02 as t0,inv0web02 as t1 where ((t0.termid=35)
and (t1.termid=141) and (t0.docid=t1.docid) and
(t0.whichpara=t1.whichpara) and ( (t1.offset - t0.offset) = 1)) order
by t0.weight_in_subdoc limit 1000
This takes about 3 minutes where term35="web" and term141="page" (very
common terms). (Timing depends slightly on whether I employ a Result
or ResUse in my MySQL++ code - but I don't think the MySQL++ is the
controlling factor here since I've also experimented with mysql on the
command line). This is on a RedHat 9 box: Dell 4600 with 12GB RAM,
2x2.8Ghz Xeon (hyperthreaded to look like 4 CPUs) and 10KRPM drives.
It's a pretty speedy system. I'm using mysql-max-4.0.17-pc-linux-i686
(binary build).
With less common terms, I'm able to get a response in just a few
seconds. The subtraction in the offset is not a deciding factor;
performance is comparable without it. Clearly, the issue is the merge
within the same table using WHERE criteria.
The "order by" doesn't matter much; the "limit" speeds things up quite
a bit since the response set is smaller. (I use these so that the
program that uses the results only needs to look at the "best," and to
limit the response set size). If I remove the "limit 1000", the query
can take well over 1/2 hour (even without actually retrieving the
rows).
I have my.cnf tuned to use a lot of memory, so we're not doing a lot
of swapping or thrashing. The total memory footprint for the mysql
process maxes at 1.5GB or so. Most of the query wall-clock time seems
to be disk access.
myisam_sort_buffer_size = 640M
Explain looks fine, in terms of using indices. For the query above:
| table | type | possible_keys | key |
key_len | ref | rows | Extra |
+-------+------+--------------------------------------------------+--------------+---------+----------+--------+-----------------------------+
| t0 | ref | PRIMARY,termid_index,whichpara_index,docid_index | termid_index |
4 | const | 405296 | Using where; Using filesort |
| t1 | ref | PRIMARY,termid_index,whichpara_index,docid_index | docid_index |
4 | t0.docid | 402 | Using where |
A little more detail follows. Thanks for any advice or ideas.
-- Greg
mysql> show create table inv0web02;
CREATE TABLE `inv0web02` (
`docid` int(10) unsigned NOT NULL default '0',
`offset` smallint(5) unsigned NOT NULL default '0',
`termid` int(10) unsigned NOT NULL default '0',
`taglistid` smallint(5) unsigned NOT NULL default '0',
`whichpara` tinyint(3) unsigned NOT NULL default '0',
`weight_in_subdoc` float unsigned NOT NULL default '0',
PRIMARY KEY (`docid`,`offset`),
KEY `termid_index` (`termid`),
KEY `whichpara_index` (`whichpara`),
KEY `taglistid_index` (`taglistid`),
KEY `weight_index` (`weight_in_subdoc`),
KEY `docid_index` (`docid`),
KEY `offset_index` (`offset`)
) TYPE=MyISAM |
# /usr/local/mysql/bin/myisamchk -divv /data/mysql/irt/inv0web02
MyISAM file: /data/mysql/irt/inv0web02
Record format: Fixed length
Character set: latin1 (8)
File-version: 1
Creation time: 2004-01-11 23:25:42
Recover time: 2004-01-12 3:31:35
Status: checked,analyzed
Data records: 498093481 Deleted blocks: 0
Datafile parts: 498093481 Deleted data: 0
Datafile pointer (bytes): 4 Keyfile pointer (bytes): 4
Datafile length: 8965682658 Keyfile length: 25881316352
Max datafile length: 77309411326 Max keyfile length: 4398046510079
Recordlength: 18
table description:
Key Start Len Index Type Rec/key Root Blocksize
1 2 4 unique unsigned long 402 185730048 1024
6 2 unsigned short 1
2 8 4 multip. unsigned long 132 9169156096 1024
3 14 1 multip. binary 1945678 11691716608 1024
4 12 2 multip. unsigned short 7600 14722002944 1024
5 15 4 multip. float 192 18786515968 1024
6 2 4 multip. unsigned long 402 22851028992 1024
7 6 2 multip. unsigned short 28253 25881315328 1024
Field Start Length Nullpos Nullbit Type
1 1 1
2 2 4
3 6 2
4 8 4
5 12 2
6 14 1
7 15 4
Dr. Gregory B. Newby, Research Faculty, Arctic Region Supercomputing Center
University of Alaska Fairbanks. PO Box 756020, Fairbanks, AK 99775
e: newby AT arsc.edu v: 907-474-7160 f: 907-474-5494 w: www.arsc.edu/~newby
--
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe: http://lists.mysql.com/[EMAIL PROTECTED]