Re: [htdig] generating a keyword list

Geoff Hutchison Fri, 23 Mar 2001 22:18:13 -0800

At 7:52 PM +0100 3/23/01, Jochen Eisinger wrote:
>The "base" approach is the following:
>       - all words of a document are taken
>       - words in a "stop list" of general words are ignored
>       - the roots of the words are determined (similiar to htfuzzy word2root)
>
>Now, I'm looking for ways to improve this (i.e. to reduce the size of the
>list without loosing much information).

I would generally take a look at word frequencies from the resulting 
list and toss out very frequent ones since they give very little 
information. For example, on the htdig.org site (counting mailing 
list archives), the words "thread," "subject," "message," etc. are 
all too common to be at all useful. But they may not be in the stop 
list based on a priori knowledge.

-- 
--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Re: [htdig] generating a keyword list

Reply via email to