Hi Gilles,

On Wed, Jan 23, 2002 at 04:18:53PM -0600, Gilles Detillieux wrote:
> If you need to make your own bad word list, here's a little trick for
> quickly determining what are the words that appear most frequently on
> your site:
> 
>    tr ':' '\011' < db.wordlist | \
>      awk '$8 == "c" { count[$1] += $9 }; $8 != "c" {count[$1]++}; END {for (i in 
>count) print count[i], i}' | \
>      sort +0nr | more
> 
> This will work for 3.1.x versions of htdig, but not 3.2 betas which don't
> have a db.wordlist file.  It'll take a while to process, as awk is pretty
> slow, but it beats counting by hand.  You'll need to sift through the
> words to pick out which are OK to exclude and which aren't.
[...]
> -----------------
> 
> To adapt this to your specific problem, you could add
> 
>     $1 ~ /^..$/ &&
> 
> to the start of the awk expression above to limit it to two letter words,
> which should help speed up the process.
> 
> You may also find that indexing all 2-letter words isn't that big a
> problem after all.  It will certainly make your wordlist and word db
> bigger, but on a reasonably fast system with adequate disk, that may
> not hurt at all.

Great suggestions!  I'll give this a shot, and see what surfaces.
You might be right... just leaving all the 2 letter words in the
index may actually be no big problem.

Thanks again,

-- 
Patrick Robinson
AHNR Info Technology, Virginia Tech
[EMAIL PROTECTED]

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to