According to Patrick Robinson:
> On Tue, Jan 22, 2002 at 02:17:58PM -0600, Gilles Detillieux wrote:
...
> > You could take the hyphen out of valid_punctuation and put it in
> > extra_word_characters instead, but that may not be what you want in
> > the general case.  Unfortunately, there's no way of singling out
> > specific words for special treatment.
> 
> That's what I was afraid of.  The removal of the hyphen when indexing
> and searching isn't so much of a concern.  I mostly wanted to be able
> to include "4-h" (or "4h") in the db.  I suppose the other "solution"
> would be to reduce minimum_word_length to 2, and then add all the other
> 2 letter words which occur to bad_word_list.  A bit of a maintenance
> headache, but maybe unavoidable in this case.

It need not be that much of a headache.  If you have a bit of disk space
to spare, you can index once and find what are the two letter words you
need to delete.

Back in December, on the subject of making your own bad_words list, I wrote:
-----------------
If you need to make your own bad word list, here's a little trick for
quickly determining what are the words that appear most frequently on
your site:

   tr ':' '\011' < db.wordlist | \
     awk '$8 == "c" { count[$1] += $9 }; $8 != "c" {count[$1]++}; END {for (i in 
count) print count[i], i}' | \
     sort +0nr | more

This will work for 3.1.x versions of htdig, but not 3.2 betas which don't
have a db.wordlist file.  It'll take a while to process, as awk is pretty
slow, but it beats counting by hand.  You'll need to sift through the
words to pick out which are OK to exclude and which aren't.

E.g., on my site, words like "spinal", "cord" and "research" are in the
top ten, which isn't altogether surprising, but I still wouldn't put them
in my bad words list.  However, words like "during", "were", "which",
"these" and "also" are in the top 20, and are good candidates for my
bad words list, if I felt inclined to put one together for my site.
-----------------

To adapt this to your specific problem, you could add

    $1 ~ /^..$/ &&

to the start of the awk expression above to limit it to two letter words,
which should help speed up the process.

You may also find that indexing all 2-letter words isn't that big a
problem after all.  It will certainly make your wordlist and word db
bigger, but on a reasonably fast system with adequate disk, that may
not hurt at all.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to