Re: [CODE4LIB] MySQL Stop Words

2009-06-01 Thread Mike Taylor
However, all of these oddities -- over eager stop-list, ignoring short
words, not counting words in more than half the rows -- can be sorted
out by configuration options.  I'm sorry I don't have them to hand,
but half an hour or so on Google should give you the information you
need to make MySQL act like a half-decent text-search engine.

 _/|____
/o ) \/  Mike Taylorm...@indexdata.comhttp://www.miketaylor.org.uk
)_v__/\  The Pope claimed he'd been wrong in the past; this was a big
 surprise -- Sting, Jeremiah Blues



Cloutman, David writes:
  It seems like there are a number of eccentricities like that. I saw
  somewhere in the documentation that if a word appears in more than half
  the rows, it isn't searched. Because of that, I'm only using MATCH to
  generate relevancy numbers. I'm doing boolean in the search terms. My
  queries are like:
  
  SELECT `path`, `mdate` FROM `rawtext` WHERE `rawText` LIKE '%book%' AND
  `rawText` LIKE '%lists%' ORDER BY MATCH (`rawText`) AGAINST ('book
  lists') DESC, `mdate` DESC
  
  (Yes, I have a table named rawtext with a column named rawText. Shoot
  me.)
  
  It's far from perfect, but I just need to index a few hundred documents,
  and it's for staff consumption. I would be curious to know how Postgres
  compares to MySQL in this area. I'm looking towards finding a long-term
  alternative to MySQL due to the Sun / Oracle merger, which I don't think
  will end well for MySQL.
  
  - David
  
  ---
  David Cloutman dclout...@co.marin.ca.us
  Electronic Services Librarian
  Marin County Free Library 
  
  -Original Message-
  From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
  Genny Engel
  Sent: Friday, May 29, 2009 12:41 PM
  To: CODE4LIB@LISTSERV.ND.EDU
  Subject: Re: [CODE4LIB] MySQL Stop Words
  
  
  Once I read a study where the document collection to be indexed was in a
  narrow technical field, and the goal was to present a search that
  quickly isolated ONLY the most relevant documents.  To this end, they
  stopworded everything that didn't sufficiently distinguish one document
  from another.   Their stopword list comprised some 30,000 terms!
   
  If your goal, on the other hand, is to maximize recall at some expense
  of precision, beware of MySQL full-text MATCH because it dynamically
  computes new stopwords.  Note this little side note in section 11.8.1 of
  the manual:
   
  For very small tables, word distribution does not adequately reflect
  their semantic value, and this model may sometimes produce bizarre
  results. For example, although the word MySQL is present in every row
  of the articles table shown earlier, a search for the word produces no
  results [ ... ] The search result is empty because the word MySQL is
  present in at least 50% of the rows. As such, it is effectively treated
  as a stopword. For large data sets, this is the most desirable behavior:
  A natural language query should not return every second row from a 1GB
  table. For small data sets, it may be less desirable. 
   
   
   
   
  Genny Engel
  Sonoma County Library
  gen...@sonoma.lib.ca.us
  707 545-0831 x581
  www.sonomalibrary.org
   
  
  
   dclout...@co.marin.ca.us 05/29/09 11:26AM 
  In building a search function for some of our internal documents in PHP
  / MySQL, I took a look at the default list of MySQL English language
  stop words used in the natural language searching feature. The list is
  actually quite extensive, and goes well beyond the typical list of to
  be cognates, common prepositions, conjunctions, etc. It also includes a
  large number of keywords that librarians or academic users might want to
  search for. Here are a few examples:
  
  available
  appropriate
  course
  follow
  former
  novel
  
  There are quite a number of other stop words that I think are suspect.
  The full list of stop words is located here:
  http://dev.mysql.com/doc/refman/5.1/en/fulltext-stopwords.html 
  
  I guess the point is that if you're building a library application that
  takes advantage of MySQL's fulltext searching features, you might want
  to customize you stop words list on your MySQL installation if you think
  your library users might want to search the word novel.
  
  - David
  
  ---
  David Cloutman dclout...@co.marin.ca.us
  Electronic Services Librarian
  Marin County Free Library 
  
  Email Disclaimer: http://www.co.marin.ca.us/nav/misc/EmailDisclaimer.cfm


Re: [CODE4LIB] MySQL Stop Words

2009-06-01 Thread Casey Bisson
The minimum word length and stop word list are run-time configurable.  
The exclusion of words that are in more than 50% of the corpus is a  
compile-time issue (or simply use boolean). Here are the settings to  
be aware of:


ft_min_word_len=3
ft_stopword_file=/dev/null

--Casey

http://about.scriblio.net/
http://maisonbisson.com/


On Jun 1, 2009, at 11:13 AM, Mike Taylor wrote:


However, all of these oddities -- over eager stop-list, ignoring short
words, not counting words in more than half the rows -- can be sorted
out by configuration options.


Re: [CODE4LIB] MySQL Stop Words

2009-05-29 Thread Genny Engel
Once I read a study where the document collection to be indexed was in a narrow 
technical field, and the goal was to present a search that quickly isolated 
ONLY the most relevant documents.  To this end, they stopworded everything that 
didn't sufficiently distinguish one document from another.   Their stopword 
list comprised some 30,000 terms!
 
If your goal, on the other hand, is to maximize recall at some expense of 
precision, beware of MySQL full-text MATCH because it dynamically computes new 
stopwords.  Note this little side note in section 11.8.1 of the manual:
 
For very small tables, word distribution does not adequately reflect their 
semantic value, and this model may sometimes produce bizarre results. For 
example, although the word MySQL is present in every row of the articles 
table shown earlier, a search for the word produces no results [ ... ] The 
search result is empty because the word MySQL is present in at least 50% of 
the rows. As such, it is effectively treated as a stopword. For large data 
sets, this is the most desirable behavior: A natural language query should not 
return every second row from a 1GB table. For small data sets, it may be less 
desirable. 
 
 
 
 
Genny Engel
Sonoma County Library
gen...@sonoma.lib.ca.us
707 545-0831 x581
www.sonomalibrary.org
 


 dclout...@co.marin.ca.us 05/29/09 11:26AM 
In building a search function for some of our internal documents in PHP
/ MySQL, I took a look at the default list of MySQL English language
stop words used in the natural language searching feature. The list is
actually quite extensive, and goes well beyond the typical list of to
be cognates, common prepositions, conjunctions, etc. It also includes a
large number of keywords that librarians or academic users might want to
search for. Here are a few examples:

available
appropriate
course
follow
former
novel

There are quite a number of other stop words that I think are suspect.
The full list of stop words is located here:
http://dev.mysql.com/doc/refman/5.1/en/fulltext-stopwords.html 

I guess the point is that if you're building a library application that
takes advantage of MySQL's fulltext searching features, you might want
to customize you stop words list on your MySQL installation if you think
your library users might want to search the word novel.

- David

---
David Cloutman dclout...@co.marin.ca.us
Electronic Services Librarian
Marin County Free Library 

Email Disclaimer: http://www.co.marin.ca.us/nav/misc/EmailDisclaimer.cfm 

Re: [CODE4LIB] MySQL Stop Words

2009-05-29 Thread Richmond,Ian
Mysql also skips any words fewer than 4 letters long unless you put in a 
special line in the config file.

From: Code for Libraries [code4...@listserv.nd.edu] On Behalf Of Genny Engel 
[gen...@sonoma.lib.ca.us]
Sent: Friday, May 29, 2009 3:41 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MySQL Stop Words

Once I read a study where the document collection to be indexed was in a narrow 
technical field, and the goal was to present a search that quickly isolated 
ONLY the most relevant documents.  To this end, they stopworded everything that 
didn't sufficiently distinguish one document from another.   Their stopword 
list comprised some 30,000 terms!

If your goal, on the other hand, is to maximize recall at some expense of 
precision, beware of MySQL full-text MATCH because it dynamically computes new 
stopwords.  Note this little side note in section 11.8.1 of the manual:

For very small tables, word distribution does not adequately reflect their 
semantic value, and this model may sometimes produce bizarre results. For 
example, although the word MySQL is present in every row of the articles 
table shown earlier, a search for the word produces no results [ ... ] The 
search result is empty because the word MySQL is present in at least 50% of 
the rows. As such, it is effectively treated as a stopword. For large data 
sets, this is the most desirable behavior: A natural language query should not 
return every second row from a 1GB table. For small data sets, it may be less 
desirable.




Genny Engel
Sonoma County Library
gen...@sonoma.lib.ca.us
707 545-0831 x581
www.sonomalibrary.org



 dclout...@co.marin.ca.us 05/29/09 11:26AM 
In building a search function for some of our internal documents in PHP
/ MySQL, I took a look at the default list of MySQL English language
stop words used in the natural language searching feature. The list is
actually quite extensive, and goes well beyond the typical list of to
be cognates, common prepositions, conjunctions, etc. It also includes a
large number of keywords that librarians or academic users might want to
search for. Here are a few examples:

available
appropriate
course
follow
former
novel

There are quite a number of other stop words that I think are suspect.
The full list of stop words is located here:
http://dev.mysql.com/doc/refman/5.1/en/fulltext-stopwords.html

I guess the point is that if you're building a library application that
takes advantage of MySQL's fulltext searching features, you might want
to customize you stop words list on your MySQL installation if you think
your library users might want to search the word novel.

- David

---
David Cloutman dclout...@co.marin.ca.us
Electronic Services Librarian
Marin County Free Library

Email Disclaimer: http://www.co.marin.ca.us/nav/misc/EmailDisclaimer.cfm


Re: [CODE4LIB] MySQL Stop Words

2009-05-29 Thread Cloutman, David
It seems like there are a number of eccentricities like that. I saw
somewhere in the documentation that if a word appears in more than half
the rows, it isn't searched. Because of that, I'm only using MATCH to
generate relevancy numbers. I'm doing boolean in the search terms. My
queries are like:

SELECT `path`, `mdate` FROM `rawtext` WHERE `rawText` LIKE '%book%' AND
`rawText` LIKE '%lists%' ORDER BY MATCH (`rawText`) AGAINST ('book
lists') DESC, `mdate` DESC

(Yes, I have a table named rawtext with a column named rawText. Shoot
me.)

It's far from perfect, but I just need to index a few hundred documents,
and it's for staff consumption. I would be curious to know how Postgres
compares to MySQL in this area. I'm looking towards finding a long-term
alternative to MySQL due to the Sun / Oracle merger, which I don't think
will end well for MySQL.

- David

---
David Cloutman dclout...@co.marin.ca.us
Electronic Services Librarian
Marin County Free Library 

-Original Message-
From: Code for Libraries [mailto:code4...@listserv.nd.edu] On Behalf Of
Genny Engel
Sent: Friday, May 29, 2009 12:41 PM
To: CODE4LIB@LISTSERV.ND.EDU
Subject: Re: [CODE4LIB] MySQL Stop Words


Once I read a study where the document collection to be indexed was in a
narrow technical field, and the goal was to present a search that
quickly isolated ONLY the most relevant documents.  To this end, they
stopworded everything that didn't sufficiently distinguish one document
from another.   Their stopword list comprised some 30,000 terms!
 
If your goal, on the other hand, is to maximize recall at some expense
of precision, beware of MySQL full-text MATCH because it dynamically
computes new stopwords.  Note this little side note in section 11.8.1 of
the manual:
 
For very small tables, word distribution does not adequately reflect
their semantic value, and this model may sometimes produce bizarre
results. For example, although the word MySQL is present in every row
of the articles table shown earlier, a search for the word produces no
results [ ... ] The search result is empty because the word MySQL is
present in at least 50% of the rows. As such, it is effectively treated
as a stopword. For large data sets, this is the most desirable behavior:
A natural language query should not return every second row from a 1GB
table. For small data sets, it may be less desirable. 
 
 
 
 
Genny Engel
Sonoma County Library
gen...@sonoma.lib.ca.us
707 545-0831 x581
www.sonomalibrary.org
 


 dclout...@co.marin.ca.us 05/29/09 11:26AM 
In building a search function for some of our internal documents in PHP
/ MySQL, I took a look at the default list of MySQL English language
stop words used in the natural language searching feature. The list is
actually quite extensive, and goes well beyond the typical list of to
be cognates, common prepositions, conjunctions, etc. It also includes a
large number of keywords that librarians or academic users might want to
search for. Here are a few examples:

available
appropriate
course
follow
former
novel

There are quite a number of other stop words that I think are suspect.
The full list of stop words is located here:
http://dev.mysql.com/doc/refman/5.1/en/fulltext-stopwords.html 

I guess the point is that if you're building a library application that
takes advantage of MySQL's fulltext searching features, you might want
to customize you stop words list on your MySQL installation if you think
your library users might want to search the word novel.

- David

---
David Cloutman dclout...@co.marin.ca.us
Electronic Services Librarian
Marin County Free Library 

Email Disclaimer: http://www.co.marin.ca.us/nav/misc/EmailDisclaimer.cfm