Re: [htdig] Bad words file question

Gilles Detillieux Wed, 31 Oct 2001 13:43:37 -0800

According to Ace:
> I suppose the bad words list contains the most often used words of the 
> language. Is it imaginable that htdig indexes all files to be indexed 
> and finds out the most often used words and prints them out, so I could 
> decide which words I want to exclude from the index to speed up searching?


htdig doesn't do this directly, but it could be done pretty easily by
analysing the db.wordlist file in 3.1.x, or running htdump and analysing
the db.worddump file in 3.2.x.  Either way, you could write a simple
awk or Perl script that would total up the word counts.

> Would it help if I told you that the university of Leipzig has published 
> word lists containing the 100, 1000 and 10000 most often used words of 
> english, german, french and dutch at 
> http://woclu2.informatik.uni-leipzig.de/html/wliste.html - no copyrights 
> and no restrictions seem to be applied to the downloadable files?

Danke shoen!  I've added this tip to FAQ 4.6.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Re: [htdig] Bad words file question

Reply via email to