Once I read a study where the document collection to be indexed was in a narrow 
technical field, and the goal was to present a search that quickly isolated 
ONLY the most relevant documents.  To this end, they stopworded everything that 
didn't sufficiently distinguish one document from another.   Their stopword 
list comprised some 30,000 terms!
 
If your goal, on the other hand, is to maximize recall at some expense of 
precision, beware of MySQL full-text MATCH because it dynamically computes new 
stopwords.  Note this little side note in section 11.8.1 of the manual:
 
For very small tables, word distribution does not adequately reflect their 
semantic value, and this model may sometimes produce bizarre results. For 
example, although the word "MySQL" is present in every row of the articles 
table shown earlier, a search for the word produces no results [ ... ] The 
search result is empty because the word "MySQL" is present in at least 50% of 
the rows. As such, it is effectively treated as a stopword. For large data 
sets, this is the most desirable behavior: A natural language query should not 
return every second row from a 1GB table. For small data sets, it may be less 
desirable. 
 
 
 
 
Genny Engel
Sonoma County Library
gen...@sonoma.lib.ca.us
707 545-0831 x581
www.sonomalibrary.org
 


>>> dclout...@co.marin.ca.us 05/29/09 11:26AM >>>
In building a search function for some of our internal documents in PHP
/ MySQL, I took a look at the default list of MySQL English language
stop words used in the natural language searching feature. The list is
actually quite extensive, and goes well beyond the typical list of "to
be" cognates, common prepositions, conjunctions, etc. It also includes a
large number of keywords that librarians or academic users might want to
search for. Here are a few examples:

available
appropriate
course
follow
former
novel

There are quite a number of other stop words that I think are suspect.
The full list of stop words is located here:
http://dev.mysql.com/doc/refman/5.1/en/fulltext-stopwords.html 

I guess the point is that if you're building a library application that
takes advantage of MySQL's fulltext searching features, you might want
to customize you stop words list on your MySQL installation if you think
your library users might want to search the word "novel".

- David

---
David Cloutman <dclout...@co.marin.ca.us>
Electronic Services Librarian
Marin County Free Library 

Email Disclaimer: http://www.co.marin.ca.us/nav/misc/EmailDisclaimer.cfm 

Reply via email to