OK, thanks. In addition to "common words", closer inspection
of my db.wordlist reveals some potential issues, such as:
- there are about 64 million words in it. I would think that
might contribute to performance problems!
- Most of them are garbage, 'zzzzzzzzzz' for example.
Is there anyway to remove garbage from that file? Since
it is just a text file, can I simply write a perl script to remove
the unwanted words?
Thanks,
Tim
Geoff Hutchison wrote:
>
> On Fri, 16 Mar 2001, Peterman, Timothy P wrote:
>
> > have "backlink_factor" set to 0 (zero). I could add some of
> > the common words to the "bad_word_list", but it's hard to
> > predict what they may be. Is there anything else I may have
>
> It's actually pretty easy to figure out common words. Try something like:
> awk '{print $1;"' db.wordlist | uniq -c | sort -rn | head -50
> For example, on htdig.org:
> 11424 date
> 11423 htdig
> 11343 archive
> 11334 subject
> 11332 author
> 11330 message
> 11318 thread
> 11001 list
> 10738 generated
> 10696 sorted
>
> > return a more informative message than "No matches found"
>
> Sure. You can edit the nomatch.html page:
> <http://www.htdig.org/attrs.html#nothing_found_file>
>
> --
> -Geoff Hutchison
> Williams Students Online
> http://wso.williams.edu/
--
Tim Peterman - Web Master,
IT&P Unix Support Group Technical Lead
Lockheed Martin EIS/NE&SS, Moorestown, NJ
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html