At 7:52 PM +0100 3/23/01, Jochen Eisinger wrote:
>The "base" approach is the following:
> - all words of a document are taken
> - words in a "stop list" of general words are ignored
> - the roots of the words are determined (similiar to htfuzzy word2root)
>
>Now, I'm looking for ways to improve this (i.e. to reduce the size of the
>list without loosing much information).
I would generally take a look at word frequencies from the resulting
list and toss out very frequent ones since they give very little
information. For example, on the htdig.org site (counting mailing
list archives), the words "thread," "subject," "message," etc. are
all too common to be at all useful. But they may not be in the stop
list based on a priori knowledge.
--
--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html