Re: Lots of FULLTEXT stuff (suggestions)

Matt W Thu, 11 Sep 2003 16:22:46 -0700

Hi Sergei!

Thanks for your reply and taking time to read and consider my
suggestions. :-)  I didn't reply sooner because I was deciding what to
say in this message. ;-)


I joined the list specifically for posting these suggestions, and, with
your reply, I wanted to say that it's great to have direct contact with
the developers like this!

I'll try to keep my observations/ideas below as short and simple to
understand as possible. :-)


----- Original Message -----
From: "Sergei Golubchik" <[EMAIL PROTECTED]>
Sent: Wednesday, August 27, 2003 3:31 PM
Subject: Re: Lots of FULLTEXT stuff (suggestions)

> Hi!
>
> First: thanks for ideas - I'm adding them to my todo :)
>
> About dates - it's very difficult to say when a particular feature
will
> be implemented. Anyway, first I'm going to finish with this 2-level
> index structure - to implement optimizations that rely on it.
>
> > Any speed/optimization improvements are welcome for gigs of data,
> > especially with IN BOOLEAN MODE (e.g. automagically sorted by
relevance
> > like a natural language query, although this is probably difficult
if a
> > wildcard* is used?).
>
> It's not possible - at least I don't know to do it.
> In natural language mode the fulltext search is done in in "Fulltext
> initialization" stage - as you noticed. So an engine can sort
documents
> on relevance. In boolean mode each found document is returned at
once -
> that's why this search mode is faster, it need not support/keep the
list
> of all matched documents.

Yeah, it would be really nice though if boolean searches could auto-sort
by relevance (see below).

I have 2 speed vs. usefulness issues -- 1 with boolean mode and another
with natural language mode:

Sure, boolean mode is faster in *some* cases, since, as you said, it
doesn't need "the list of all matched documents." From my experience,
it's [only] faster in searches like ' some words ' (no boolean
operators; and yes, I know "some" is a stopword ;-)), especially with
LIMIT, but it just returns the first documents it finds with any single
word. I think this is pretty useless as you will get many rows that
would be low relevance. If you use boolean mode without boolean
operators, you should've just used natural language and, if combined
with LIMIT, you would fairly quickly get the most relevant documents
with "any" of the words -- and the ones that contain more of the words
will usually be ranked highest from my experience. The results are MUCH
better than boolean mode without boolean operators, even though it may
take slightly longer.

And now the issue with natural language mode: If the needed parts of the
datafile aren't cached by the OS, all the random disk seeks are KILLING
performance! However, once the query is run once and the OS has cached
the datafile parts, the search is VERY quick (with LIMIT to prevent a
ton of rows). I would say faster than any but the simplest ' no boolean
operator ' boolean searches. And the results are relevance sorted!

It would be great if these disk seeks could be optimized to read a chunk
of rows at a time *in row order* as it seems right now that each row is
read one-at-a-time in relevance order. Like if you could take a chunk
of, say, 1000 row pointers which are in relevance order, sort them in
datafile row order, and then read them like that. Wouldn't this cause
fewer random seeks since you keep moving "in the same direction" in the
file? Of course you'd need to get that chunk of rows back in relevance
order if they were read in row order. :-) If you can't hold those rows
in memory to put back in order, maybe you could read/discard,
read/discard, and so on, the rows in row order, then when you read them
in random relevance order, the data would at least be cached by the OS.
Just something I, who doesn't know much about file access, was thinking
about! :-) Boolean mode seems to read the datafile faster because, at
least in my fresh table, the results are found/read in datafile order.

Back to boolean mode. I don't think it's faster than natural language
(especially on subsequent same queries) when you have a search like '
+some +words ' or ' "some words" '. When "some" and "words" appear in
many documents, but rarely, if ever, appear in the same document. It
uses lots of CPU time finding one word in the index and failing on the
boolean criteria.

Right now if I want all words in my application, I'm favoring using a
natural language query with LIMIT 1000 or so and then running another
query with LIKE to check those 1000 document IDs to see if they contain
all words. And the documents that do will almost certainly be in the
first rows returned. e.g. once they're not for more than a few documents
in a row, they probably won't all appear in a document again in the
lower relevance results.

Heck, even if I want to simulate (' some +words ' IN BOOLEAN MODE) with
natural language, I can use (' some words words '). By specifying
"words" multiple times, it gives it higher relevance, so I can again
check with LIKE in another query that the first matches do contain
"words".

If boolean mode could auto-sort by relevance, it should first find rows
are most likely to match ' +some +words ' or ' "some words" ' or ' some
+words '. Then if it could abort if it hasn't found the *required* words
in *any* of the last 1000 documents (which would be in decreasing
relevance), it should reduce the "matches" that ultimately fail
currently in boolean mode.

Also, I've been thinking that, to a novice user, it would be logical
that boolean mode *should* sort results by relevance if that's what they
get with natural language mode. After all, to me, boolean mode simply
implies that you can use the boolean operators to require/exclude words,
etc. It certainly doesn't imply that the results, while matching the
boolean criteria, can be radically different relevance-wise.

I was even thinking that IN BOOLEAN MODE syntax should be removed
(ignored, actually, to maintain backwards compatibility) and whether or
not boolean operators are used in the query determines whether or not
boolean mode is used (relevance sorting in both cases). This seems
logical. :-)

Or maybe instead, using IN BOOLEAN MODE *syntax*, the results wouldn't
auto-sort by relevance (just like now), but it would also be possible to
use boolean operators *without* IN BOOLEAN MODE specified and the
results would sort by relevance like a natural language query.

About determining relevance if a ' wildcard* ' is used... couldn't you
use the average relevance or such of all words that start with the
prefix "wildcard"? Seems like that's as accurate as the relevance could
be and the best you could do. :-)


> > And the FULLTEXT index shouldn't always be chosen
> > for non-const join types when another index would find less rows
first.
> > e.g. ... WHERE MATCH ... AND primary_key IN (1, 2); should use the
> > PRIMARY key, not the FULLTEXT. :-) But maybe that's not possible,
since
> > I guess it's a problem auto sorting by relevance if it's not using
the
> > FULLTEXT index.
>
> Hmm. The logic in making FULLTEXT index always the preferred one is
that
> even if it's not the index as reported by EXPLAIN, it is still used
> in "Fulltext initialization". So, using it in join to retrieve rows
> adds no extra costs.
>
> But now I think that there is still the cost of reading row data from
> disk, so using PRIMARY/UNIQUE index can be faster in some cases.
>
> I am not sure, though, optimizer can take this into account properly -
> to know the number of matched rows before choosing an index
> would mean doing fulltext search for EXPLAIN too - I doubt it will be
> appreciated :)
>
> Still, with 2-level index some estimations can be made...
> Great - thanks for the idea!
>
> Anyway, in boolean mode there's no "initialization" so there's no
> reasons (besides historical) for it to be "preferred" - it'll be
fixed.
>
> > To the developers: any word on if and when any of these things would
be
> > implemented? I know from the TODO and other list messages that some
> > will. Any *estimates* (in months or MySQL version) on when would be
> > great. Just any info on new full-text features, even ones that I
didn't
> > mention, would be awesome to hear. :-) And like how they would be
> > implemented and used by us (syntax, etc.).
>
> As I told - it's very difficult to predict this :(
> Anyway, I doubt anything that requires changing .frm
> file structure will get into 4.1
>
> > How about changing the default min/max (or just min if you want)
word
> > length? I think everyone *really* wishes ft_min_word_len was 3.
Seems
> > like that and indexing numbers shorter than min_word_len could be
easily
> > done. Please? :-)
>
> Yes, it's safe enough for 4.1

Sorry, I don't know what this means. :-) You mean ft_min_word_len will
be 3 by default in 4.1? And what is "safe enough?"


> > There Sergei is talking about a new .frm format (plain text) that
will
> > allow more of these features. Will it allow us to somehow define how
to
> > parse things or something?? Could you elaborate more on what this
will
> > bring? In November 2001, he said the new .frm format would be here
"this
> > year." It's been almost 2 years since then, so when is it do?
>
> It's now planned for 5.1 - plain text .frm comes together with
complete
> redesign of internal table structure handling, table structure cache,
> etc.
>
> But even without it .frm format was extended in 4.1 so I don't need it
> for adding per-index options anymore.

That's good news! I forgot that it was extended in 4.1 for per-column
char sets, etc. I don't know what sort of per-index options you can
implement with 4.1 before new 5.1 .frm format (few, most?), but it'd be
wonderful to see some soon. ;^)


> > Also, are the current MySQL versions using the "2 level" full-text
index
> > format yet? I'm thinking not?
>
> 4.1.0 is using it.
> This index structure was done to make possible new powerful
> optimizations. It is these optimizations what is not implemented yet
:(
> It's in my highest-priority todo.
>
> > Finally, in the full-text TODO, it says "Generic user-suppliable UDF
> > preparser." Could you also elaborate on this? The "generic" part
almost
> > makes it sound like some sort of "script" to define how to parse the
> > text. But UDF makes it sound like a separate thing that has to be
loaded
> > with CREATE FUNCTION. But UDFs won't work with your MySQL binaries,
will
> > they, since they're complied statically?
>
> mysql-max binary is compiled dynamically - so it works with UDF's.
> And "UDF" in the todo item does not mean it will MySQL "User-Definable
> Function" yet - it could be a Stored Procedure, e.g.
>
> The idea is to be able to supply a function (whatever it is) that
takes
> a column's data and returns a list of words that this "data" contain.
> It could be used e.g. to fulltext-index pdf's or xml's or MS Word
files,
> or whatever.

Oh, I see. That sounds very useful!


> Regards,
> Sergei

Thanks for reading. And thanks all your full-text work Sergei!


Matt

P.S. Is there a document somewhere that has information about the
internals of full-text search or MySQL in general? I noticed this "bk
commit - mysqldoc tree (1.790)" message on the Internals list the other
day:

http://lists.mysql.com/list.php?3:mss:9961:200309:bpfbpgphemknogaidjep

and bk commit - mysqldoc tree (1.799)

http://lists.mysql.com/list.php?3:mss:9996:200309:lejkmpinlmdninacgcpl

They appear to be updating a document about internals (inc. full-text
search). However, I can't find this document anywhere (source (for
Win32), MySQL site/docs). Could you tell me where to get it? :-) Thanks!


-- 
MySQL General Mailing List
For list archives: http://lists.mysql.com/mysql
To unsubscribe:    http://lists.mysql.com/[EMAIL PROTECTED]

Re: Lots of FULLTEXT stuff (suggestions)

Reply via email to