According to Thomas von Eyben:
> Does anyone know if there exists a similair list of danish words like the
> ones that are refered to here:
> http://www.htdig.org/FAQ.html#q4.6
>
> It could save me a lot of hours (and diskspace :-)
The only other contributed bad word lists we've received have been
English ones. If anyone does send you a Danish one, we wouldn't mind
having a copy for the Contributed Work section of the htdig.org site.
If you need to make your own bad word list, here's a little trick for
quickly determining what are the words that appear most frequently on
your site:
tr ':' '\011' < db.wordlist | \
awk '$8 == "c" { count[$1] += $9 }; $8 != "c" {count[$1]++}; END {for (i in
count) print count[i], i}' | \
sort +0nr | more
This will work for 3.1.x versions of htdig, but not 3.2 betas which don't
have a db.wordlist file. It'll take a while to process, as awk is pretty
slow, but it beats counting by hand. You'll need to sift through the
words to pick out which are OK to exclude and which aren't.
E.g., on my site, words like "spinal", "cord" and "research" are in the
top ten, which isn't altogether surprising, but I still wouldn't put them
in my bad words list. However, words like "during", "were", "which",
"these" and "also" are in the top 20, and are good candidates for my
bad words list, if I felt inclined to put one together for my site.
--
Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html