According to David Adams:
> I am using htdig 3.1.2, and my config file includes:
> 
> extra_word_characters:  _
> valid_punctuation:      !@#$%^&*()-+|~=`{}[]:";'<>?,./
> 
> I find that the word database build by htdig includes many words that
> contain or end in a comma or other punctuation. For example:
> 
> arts,   i:2514  l:1     w:49950
> assessed,       i:2523  l:1     w:49950
> atmospheric,    i:2529  l:1     w:49950
> b.sc,   i:120   l:1     w:49950
> b.sc,   i:16406 l:1     w:49950
> b.sc,   i:16409 l:1     w:49950
> b.sc,   i:3039  l:1     w:49950
> b.sc,   i:3040  l:1     w:49950
> b.sc,   i:3041  l:1     w:49950
> ba,     i:17    l:1     w:49950

I believe part of the problem may be the left quote (`) character
in the list above, which is taken as the start of a file expansion
(e.g. `filename`).  As there's no file called "{}[]:";'<>?,./", the left
quote and everything after it is lost from the valid_punctuation list.
You'd need to escape the left quote with a backslash (\).  The same
thing goes for the dollar sign ($), only in this case it's just that
one character that's lost.

Still, that wouldn't explain why the comma and period get entered into
the database.  This would suggest that those characters were in the
extra_word_characters list, or were erroneously treated as alphanumeric
by your locale's LC_CTYPE tables.

> Am I misunderstanding the documentation on "valid_punctuation"?
> 
> I can't figure out how the configuration file attributes 
> 
>       extra_word_characters 
> and
>       valid_punctuation 
> 
> work together.  What happens when the same character is in both?

The lists should not overlap, but if they do, I believe valid_punctuation
overrides, so the overlapping characters do get stripped out of the word.

Essentially, both lists indicate which punctuation marks or other
characters can be used within a word, but the valid_punctuation characters
get stripped out before the word is put in the database.  E.g. words like
post-doctoral and nuts&bolts go into the database as postdoctoral and
nutsbolts, unless you move the hyphen or ampersand from valid_punctuation
to extra_word_characters, in which case the characters stay in the word.

Additionally, with the compound word patch I posted last week, and which
will be in future releases, the word will be split up at places that
have a non-alphanumeric character that's in valid_punctuation, but not
in extra_word_characters.  Thus, a word like post-doctoral will go into
the database as postdoctoral, post and doctoral.

> Why doesn't the documented list of default characters for
> valid_punctuation include the question mark (?) and the doublequote (")?

This is because these characters aren't commonly used within words,
unlike apostrophes, ampersands, hyphens and slashes.  Also, when you set
allow_numbers to index numbers as words, these numbers may contain some of
these characters:  .-/#$% , and that's why they're in the default list.
I don't know why _!^ are in the default list, but I suspect they may be
used for indexing source code.  If a given punctuation mark should ALWAYS
separate words, it should not be added to this list.

> What separates words, is it whitespace only?

White space or any punctuation character (actually, any non-alphanumeric
character) not listed in extra_word_characters or valid_punctuation.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED] containing the single word unsubscribe in
the SUBJECT of the message.

Reply via email to