On Tuesday 27 Nov 2001 20:17, Mike Wexler wrote:
> > The query
> >
> > SELECT * FROM table
> > WHERE MATCH (description) AGAINST ("+A +really +nice +toy" IN BOOLEAN
> > MODE) ORDER BY MATCH (description) AGAINST ("+A +really +nice +toy" IN
> > BOOLEAN MODE) LIMIT 10
> >
> > will abort the search when it retrieves the 500th document with all the
> > three words. It's impossible to say, how many index entries it will look
> > at.
I take it the "IN BOOLEAN MODE" part of the AGAINST() is going to be new to
4.0.1.
Incidentally, how are the WHERE clauses handled when MATCH/AGAINST is used
for FTS? Given that I am seeing a fairly linear increase in query time with
the increase in number of matched terms, I would guess that the FTS is
performed first. Especially since limiting other constraints in the WHERE
clause produces no noticeable reduction in query time. This seems to be
wasteful.
Considering that FTS is likely the slowest part of the query, it would
probably be beneficial in terms of performance to have it execute last, with
all other "simpler" constraints being satisfied first, so fewer records need
to be searched.
Another question - is there a way to acquire a list of words in the FTS
index? Someting like
SELECT Word,
count(*) AS Frequency
FROM FTSIndex
GROUP BY Word
ORDER BY FREQUENCY ASC
LIMIT 100;
This would allow for easier overview of what "dead" words are being indexed,
and therefore allow for easier isolation of new "stop words", and reduction
in unnecessary searching that FTS would have to perform, thus increasing
performance. Considering that I'm really after SELECT speed, would more
careful tuning of stop words be likeky to yield signifficant performance
improvements?
It would also be REALLY nice to have a "dynamic" list of stop words. I know
you said that this is definitely planned, but it would be nice to know how
soon...
Another thing - it would probably be useful to gather some statistics about
FTS queries performed. For example, say the terms 'perl' or 'linux' appear
10000 times in the queries (since the last checkpoint interval, or whatever).
But the words 'nt' and 'vb' appear 0 times. It might be beneficial to tune
the FTS in a slightly dynamic way, so that the frequently searched terms have
a sort of "keep-alive", while the words that are never searched for (but are
indexed) get added to the stop-word list.
OTOH, if a stop word does end up being searched for a number of times that
exceeds some threshold (say, 0.1%), it could be removed from the stop word
list. The FTS index would then need to be updated, but if the database
structure is dynamic (for example a search engine of some sort) where the
records get deleted and inserted all the time, the database would be able to
"learn" what are "start words" and what are "stop words".
These statistics could be analyzed and acted upon when, say, ANALYZE is
issued. It would make the FTS "self-tuning". Of course, the FTS feature in
the TODO that allows specifiying explicitly strings that get indexed "as they
are" becomes much more useful, then, as it can prevent things from being
automatically added to the "stop word" list during ANALYZE/OPTIMIZE stage.
Has any of this been at least thought about? I've just checked the TODO, and
it doesn't appear to be there...
Looking forward to 4.0.1.
BTW, will the file formats be compatible? Or will it require a dump + restore
of the database, when going from 4.0.0 to 4.0.1?
Regards.
Gordan
---------------------------------------------------------------------
Before posting, please check:
http://www.mysql.com/manual.php (the manual)
http://lists.mysql.com/ (the list archive)
To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php