>
> On Mon, 7 Jan 2002, Arne Mueller wrote:
> > I wonder whether one can use the full text indexes in mysql to find out
> > what words in a document are likely to be relevant key words.
> >
> > . . .
> >
> > I'd be nice to have a command like this:
> >
> > select keywords(10.0) from MyDocs where DocId = 666;
> >
> > . . .
>
> Isn't function MATCH what you want? Example from the manual:
>
> mysql> SELECT *,MATCH a,b AGAINST ('collections support') as x FROM t;
> +------------------------------+-------------------------------+--------+
> | a | b | x |
> +------------------------------+-------------------------------+--------+
> | MySQL has now support | for full-text search | 0.3834 |
> | Full-text indexes | are called collections | 0.3834 |
> | Only MyISAM tables | support collections | 0.7668 |
> | Function MATCH ... AGAINST() | is used to do a search | 0 |
> | Full-text search in MySQL | implements vector space model | 0 |
> +------------------------------+-------------------------------+--------+
> 5 rows in set (0.00 sec)
>
> The function MATCH matches a natural language query AGAINST a text
> collection (which is simply the columns that are covered by a FULLTEXT
> index). For every row in a table it returns relevance - a similarity
> measure between the text in that row (in the columns that are part of the
> collection) and the query. When it is used in a WHERE clause (see example
> above) the rows returned are automatically sorted with relevance
> decreasing.
The problem is that I don't know the expression for the 'AGAINST' part.
Given a document I'd like to know what it is about without reading it.
Using the MATCH AGAINST
functions to extract the most relevant key words from a single document
I'd have to do something like this:
foreach word in Document with DocId = N, do:
SELECT MATCH text_column AGAINST (word) FROM table where DocId = N;
if relevance of match > 0.5, do:
remember this word as a relevant keyword
print all keywords fo Document with DocId = N
But I guess this is far too slow. Basically I've to implement this
myself using a table for the docuemnts, a table for each word and a
table that links a document with it's words. Each word in the word table
will have a counter that tells me how often this word occures in all
documents of the document table, and the linker table (that links the
docs with the words) contains a counter column to count the frequency of
a word in this particular document. From this I can extract the most
relevant words for each document. The overall frequencies in the word
table have to updated everytime a new document is inserted, but this is
ok for me. The relevance for the word 'mysql' in document X could be
calculated as the frequency of 'mysql' in X (stored in the linker table)
divided by 'mysql' in all documents (stored in the word table). The more
this number is close to 1 the better the score ...
Has anyone here implemented such a text mining database? I'd be
interested in your solutions and experience.
thanks alot for comments,
Arne
--
Arne Mueller
Biomolecular Modelling Laboratory
Imperial Cancer Research Fund
44 Lincoln's Inn Fields
London WC2A 3PX, U.K.
phone : +44-(0)207 2693405 | fax :+44-(0)207-269-3534
email : [EMAIL PROTECTED] | http://www.bmm.icnet.uk
---------------------------------------------------------------------
Before posting, please check:
http://www.mysql.com/manual.php (the manual)
http://lists.mysql.com/ (the list archive)
To request this thread, e-mail <[EMAIL PROTECTED]>
To unsubscribe, e-mail <[EMAIL PROTECTED]>
Trouble unsubscribing? Try: http://lists.mysql.com/php/unsubscribe.php