Hi Gilles,
On Wed, Jan 23, 2002 at 04:18:53PM -0600, Gilles Detillieux wrote:
> If you need to make your own bad word list, here's a little trick for
> quickly determining what are the words that appear most frequently on
> your site:
>
> tr ':' '\011' < db.wordlist | \
> awk '$8 == "c" { count[$1] += $9 }; $8 != "c" {count[$1]++}; END {for (i in
>count) print count[i], i}' | \
> sort +0nr | more
>
> This will work for 3.1.x versions of htdig, but not 3.2 betas which don't
> have a db.wordlist file. It'll take a while to process, as awk is pretty
> slow, but it beats counting by hand. You'll need to sift through the
> words to pick out which are OK to exclude and which aren't.
[...]
> -----------------
>
> To adapt this to your specific problem, you could add
>
> $1 ~ /^..$/ &&
>
> to the start of the awk expression above to limit it to two letter words,
> which should help speed up the process.
>
> You may also find that indexing all 2-letter words isn't that big a
> problem after all. It will certainly make your wordlist and word db
> bigger, but on a reasonably fast system with adequate disk, that may
> not hurt at all.
Great suggestions! I'll give this a shot, and see what surfaces.
You might be right... just leaving all the 2 letter words in the
index may actually be no big problem.
Thanks again,
--
Patrick Robinson
AHNR Info Technology, Virginia Tech
[EMAIL PROTECTED]
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html