Re: Lots of FULLTEXT stuff (suggestions)

Sergei Golubchik Fri, 12 Sep 2003 11:48:11 -0700

Hi!

On Sep 11, Matt W wrote:
> Hi Sergei!
> 
> I'll try to keep my observations/ideas below as short and simple to
> understand as possible. :-)


Thanks :)

Here I reply very quickly to some questions.
Another reply will follow...
 
> Sure, boolean mode is faster in *some* cases, since, as you said, it
> doesn't need "the list of all matched documents." From my experience,
> it's [only] faster in searches like ' some words ' (no boolean
> operators; and yes, I know "some" is a stopword ;-)), especially with
> LIMIT, but it just returns the first documents it finds with any single
> word.

No, it's MUCH faster then search string contains popular words.  E.g.
1,000,000 rows, and the few (say, five) words that are present in
400,000 rows each, but intersection is small (numbers similar to my
benchmarks).  This will make NL fulltext search engine to keep a list of
1,000,000 documents (w/o actual row data of course) in the memory - and
search through it - it'll be hell as slow.

> ...
> the ones that contain more of the words
> will usually be ranked highest from my experience.

Normally yes.
But this is not guaranteed. E.g. the document that contain only one of
the words, but is short can have higher relevance than the other one
than contains both words, but also has many other words, so relative
weight of these two words in the document is low. This effect is only
noticeable when texts' lengths are significatly different.

> It would be great if these disk seeks could be optimized to read a chunk
> of rows at a time *in row order* as it seems right now that each row is
> read one-at-a-time in relevance order. Like if you could take a chunk
> of, say, 1000 row pointers which are in relevance order, sort them in
> datafile row order, and then read them like that. Wouldn't this cause
> fewer random seeks since you keep moving "in the same direction" in the
> file?

Yes, that would.
I can say it for sure, as this optimization is already implemented in
MySQL. It is used in filesort. I should've known it can be used here too
:)
Thanks.

> Right now if I want all words in my application, I'm favoring using a
> natural language query with LIMIT 1000 or so and then running another
> query with LIKE to check those 1000 document IDs to see if they contain
> all words.

You can simply do AND MATCH ... AGAINST ("..." IN BOOLEAN MODE)+0

+0 will guarantee that in this case an index will not be used,
so that it will be used for your MATCH in natural-language mode.
(I cannot say out of my head what MATCH from both optimizer will prefer,
so this simple trick will do)

> I was even thinking that IN BOOLEAN MODE syntax should be removed
> (ignored, actually, to maintain backwards compatibility) and whether or
> not boolean operators are used in the query determines whether or not
> boolean mode is used (relevance sorting in both cases). This seems
> logical. :-)

Yes, it how it worked in the very first version.
Then I though to make it explicit to avoid problems with boolean
operators that can occasionally be present in the natural language query
(especially when it is a big chunk of text - a very typical application
for NL search).
 
> > > To the developers: any word on if and when any of these things would be
> > > implemented? I know from the TODO and other list messages that some
> > > will. Any *estimates* (in months or MySQL version) on when would be
> > > great. Just any info on new full-text features, even ones that I didn't
> > > mention, would be awesome to hear. :-) And like how they would be
> > > implemented and used by us (syntax, etc.).
> >
> > As I told - it's very difficult to predict this :(
> > Anyway, I doubt anything that requires changing .frm
> > file structure will get into 4.1
> >
> > > How about changing the default min/max (or just min if you want) word
> > > length? I think everyone *really* wishes ft_min_word_len was 3.  Seems
> > > like that and indexing numbers shorter than min_word_len could be easily
> > > done. Please? :-)
> >
> > Yes, it's safe enough for 4.1
> 
> Sorry, I don't know what this means. :-) You mean ft_min_word_len will
> be 3 by default in 4.1? And what is "safe enough?"

"safe enough" means that no big features can be added to 4.1 anymore.
Only small local changes.
Most fulltext-related changes are local :)
 
> P.S. Is there a document somewhere that has information about the
> internals of full-text search or MySQL in general? I noticed this "bk
> commit - mysqldoc tree (1.790)" message on the Internals list the other
> day:
> http://lists.mysql.com/list.php?3:mss:9961:200309:bpfbpgphemknogaidjep
> and bk commit - mysqldoc tree (1.799)
> http://lists.mysql.com/list.php?3:mss:9996:200309:lejkmpinlmdninacgcpl
> They appear to be updating a document about internals (inc. full-text
> search). However, I can't find this document anywhere (source (for
> Win32), MySQL site/docs). Could you tell me where to get it? :-) Thanks!

It's in the bk tree only.
But you may use http bk interface:

http://mysql.bkbits.net:8080/mysqldoc
http://mysql.bkbits.net:8080/mysqldoc/anno/Docs/[EMAIL PROTECTED]|src/|src/Docs 

And there's not that much about fulltext, unfortunately :(

Regards,
Sergei

-- 
   __  ___     ___ ____  __
  /  |/  /_ __/ __/ __ \/ /   Sergei Golubchik <[EMAIL PROTECTED]>
 / /|_/ / // /\ \/ /_/ / /__  MySQL AB, Senior Software Developer
/_/  /_/\_, /___/\___\_\___/  Osnabrueck, Germany
       <___/  www.mysql.com

-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Re: Lots of FULLTEXT stuff (suggestions)

Reply via email to