OK, thanks.  In addition to "common words", closer inspection
of my db.wordlist reveals some potential issues, such as:

- there are about 64 million words in it.  I would think that
might contribute to performance problems!

- Most of them are garbage, 'zzzzzzzzzz' for example.

Is there anyway to remove garbage from that file?  Since
it is just a text file, can I simply write a perl script to remove
the unwanted words?

Thanks,

Tim

Geoff Hutchison wrote:
> 
> On Fri, 16 Mar 2001, Peterman, Timothy P wrote:
> 
> > have "backlink_factor" set to 0 (zero).  I could add some of
> > the common words to the "bad_word_list", but it's hard to
> > predict what they may be.  Is there anything else I may have
> 
> It's actually pretty easy to figure out common words. Try something like:
> awk '{print $1;"' db.wordlist | uniq -c | sort -rn | head -50
> For example, on htdig.org:
>   11424 date
>   11423 htdig
>   11343 archive
>   11334 subject
>   11332 author
>   11330 message
>   11318 thread
>   11001 list
>   10738 generated
>   10696 sorted
> 
> > return a more informative message than "No matches found"
> 
> Sure. You can edit the nomatch.html page:
> <http://www.htdig.org/attrs.html#nothing_found_file>
> 
> --
> -Geoff Hutchison
> Williams Students Online
> http://wso.williams.edu/

-- 
Tim Peterman - Web Master,
IT&P Unix Support Group Technical Lead
Lockheed Martin EIS/NE&SS, Moorestown, NJ

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to