Re: Lots of FULLTEXT stuff (suggestions)

Thomas Spahni Tue, 26 Aug 2003 11:08:21 +0000

Matt,

I fully agree that indexing short words and numbers is a necessity
sometimes. I'm processing legal text where abbreviations are widely used
and people want to search for chunks like:
      Art. 234 Abs. 3 OR
and the search should also find occurrances of
      Art. 234 OR


These are so common that I risk to run into the 50% cutoff unless using
BOOLEAN MODE. Indexing numbers is top of my wish list.

I observed user's queries for some time now and found that the ranking of
the results is optimal (i.e. match the user's expectations) when the words
he typed occur close together in the text (but not necessarily close to
the top).

Regards,
Thomas Spahni

On Sun, 24 Aug 2003, Matt W wrote:

> Hi all,
>
> I'm planning to use MySQL's full-text search for my forum system
> (possibly 5+ million posts). I've been playing with it a lot lately to
> see the performance and functionality and have some
> suggestions/questions.
>
> First, since a few of you may be wanting to know, here is a thread where
> I was doing some speed/optimization tests and stuff with 3 million
> posts: http://www.sitepointforums.com/showthread.php?threadid=69555
> (From post #12)
>
> Especially discovered that IN BOOLEAN MODE is really slow if you want to
> sort by relevance (with a lot of matching rows anyway). :-( For
> non-BOOLEAN searches, though, I can get 1000 relevance-sorted results in
> about 8-10 secs. for searches that match a LOT of rows and everything
> has to be read from disk. The full-text processing seems to be very fast
> (max 1-2 seconds of "FULLTEXT initialization" in PROCESSLIST). It's the
> disk seeks to read random rows from the data file ("Sending data") that
> take the most time (7200 RPM/~8ms seek IDE drive). Searches are *MUCH*
> faster when the needed parts of the data file are cached by the OS!
>
> Anyway, my suggestions:
>
> --------------------------------------------------
> *) Min/Max Word Length -- This should really be able to be set on at
> least a per table basis (others may want per index). Right now, people
> that don't have control of the server are at the mercy of the admin to
> change the min/max word length.
>
> I would also suggest that ft_min_word_len be 3 and ft_max_word_len be 32
> by default. I think these would be better defaults for everyone than the
> current 4/254.
>
> Or if we could use
>
> SET ft_min_word_len=n;
>
> etc. for the current connection it would be nice.
>
>
> *) Parser: Indexing of Any and All Numbers -- I think it would be a good
> idea to index any sequence of digits less than ft_min_word_len long.
> Anything numeric could be very relevant for searching -- software
> versions, ages, dates, etc. -- and shouldn't be excluded.
>
> Even anything *containing* a number (among letters) is probably relevant
> for searching, again, even if it's shorter than ft_min_word_len. e.g.
> RC1, B2, 8oz, F5, etc.
>
>
> *) Parser: Other Things -- I've seen people trying to search
> catalog/item/part numbers with "pieces" of the "number" separated by -
> or / for example (making some "pieces" too short). How about indexing
> words that are on either side of a "-" or "/" (with no space) no matter
> their length? I don't mean including the - or / in the index -- just the
> usual word characters on either side (I think) as *separate* words, not
> a *single* word with the - or / removed. This would help with things
> like CD-ROM, TCP/IP, etc.
>
> Single quotes being counted as a word character is another issue I have.
> (I discovered that they're not counted as part of the word when on the
> end(s): 'quote' (thank God! :-))) Example: if someone searches for
> MySQL, it won't find rows with MySQL's. Since possessive's (sic) are the
> biggest problem, how about stripping any 's from the end of the word in
> the index? So MySQL's would be indexed as MySQL.
>
>
> *) "Always Index" Words -- Like it says in the full-text TODO section of
> the manual. This should be able to be set on at least a per table basis
> (again, others may want per index).
>
>
> *) Stopword File -- I would also like to be able to define this per
> table somehow.
>
>
> *) Miscellaneous -- Mostly functionality related, from the TODO:
> STEMMING! (controlled more finely than server level I hope), multi-byte
> character set support, proximity operators. Anything to get it closer to
> Verity's full-text functionality. ;-)
>
> Any speed/optimization improvements are welcome for gigs of data,
> especially with IN BOOLEAN MODE (e.g. automagically sorted by relevance
> like a natural language query, although this is probably difficult if a
> wildcard* is used?). And the FULLTEXT index shouldn't always be chosen
> for non-const join types when another index would find less rows first.
> e.g. ... WHERE MATCH ... AND primary_key IN (1, 2); should use the
> PRIMARY key, not the FULLTEXT. :-) But maybe that's not possible, since
> I guess it's a problem auto sorting by relevance if it's not using the
> FULLTEXT index.
> --------------------------------------------------
>
> To other full-text users: what do you think of these suggestions?
>
> To the developers: any word on if and when any of these things would be
> implemented? I know from the TODO and other list messages that some
> will. Any *estimates* (in months or MySQL version) on when would be
> great. Just any info on new full-text features, even ones that I didn't
> mention, would be awesome to hear. :-) And like how they would be
> implemented and used by us (syntax, etc.).
>
> How about changing the default min/max (or just min if you want) word
> length? I think everyone *really* wishes ft_min_word_len was 3. Seems
> like that and indexing numbers shorter than min_word_len could be easily
> done. Please? :-)
>
> Here's a couple mailing list threads about full-text:
> http://lists.mysql.com/list.php?3:sss:2365
> http://lists.mysql.com/list.php?3:sss:6749
>
> There Sergei is talking about a new .frm format (plain text) that will
> allow more of these features. Will it allow us to somehow define how to
> parse things or something?? Could you elaborate more on what this will
> bring? In November 2001, he said the new .frm format would be here "this
> year." It's been almost 2 years since then, so when is it do? ;-/ Talk
> of a "dynamic" stopword list sounds interesting.
>
> Also, are the current MySQL versions using the "2 level" full-text index
> format yet? I'm thinking not?
>
> Finally, in the full-text TODO, it says "Generic user-suppliable UDF
> preparser." Could you also elaborate on this? The "generic" part almost
> makes it sound like some sort of "script" to define how to parse the
> text. But UDF makes it sound like a separate thing that has to be loaded
> with CREATE FUNCTION. But UDFs won't work with your MySQL binaries, will
> they, since they're complied statically?
>
> Looking forward to any comments from the developers and other users.
> Thanks in advance!
>
> Matt
>
>
>


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Re: Lots of FULLTEXT stuff (suggestions)

Reply via email to