Once I read a study where the document collection to be indexed was in a narrow technical field, and the goal was to present a search that quickly isolated ONLY the most relevant documents. To this end, they stopworded everything that didn't sufficiently distinguish one document from another. Their stopword list comprised some 30,000 terms! If your goal, on the other hand, is to maximize recall at some expense of precision, beware of MySQL full-text MATCH because it dynamically computes new stopwords. Note this little side note in section 11.8.1 of the manual: For very small tables, word distribution does not adequately reflect their semantic value, and this model may sometimes produce bizarre results. For example, although the word "MySQL" is present in every row of the articles table shown earlier, a search for the word produces no results [ ... ] The search result is empty because the word "MySQL" is present in at least 50% of the rows. As such, it is effectively treated as a stopword. For large data sets, this is the most desirable behavior: A natural language query should not return every second row from a 1GB table. For small data sets, it may be less desirable. Genny Engel Sonoma County Library gen...@sonoma.lib.ca.us 707 545-0831 x581 www.sonomalibrary.org
>>> dclout...@co.marin.ca.us 05/29/09 11:26AM >>> In building a search function for some of our internal documents in PHP / MySQL, I took a look at the default list of MySQL English language stop words used in the natural language searching feature. The list is actually quite extensive, and goes well beyond the typical list of "to be" cognates, common prepositions, conjunctions, etc. It also includes a large number of keywords that librarians or academic users might want to search for. Here are a few examples: available appropriate course follow former novel There are quite a number of other stop words that I think are suspect. The full list of stop words is located here: http://dev.mysql.com/doc/refman/5.1/en/fulltext-stopwords.html I guess the point is that if you're building a library application that takes advantage of MySQL's fulltext searching features, you might want to customize you stop words list on your MySQL installation if you think your library users might want to search the word "novel". - David --- David Cloutman <dclout...@co.marin.ca.us> Electronic Services Librarian Marin County Free Library Email Disclaimer: http://www.co.marin.ca.us/nav/misc/EmailDisclaimer.cfm