Re: [htdig3-dev] Extra word-characters attribute: extra_word_characters

Gilles Detillieux Fri, 12 Mar 1999 12:39:25 -0500

According to Hans-Peter Nilsson:
> I plan to add a new attribute: extra_word_characters.
> It is the opposite (or something) to valid_punctuation, it marks a
> (possibly) non-alphanumeric as a valid word-character.
> 
> This way (and no other I know of), I can make "_" characters part of
> words, and searchable as such.
> 
> A (hopefully) positive side-effect is that people having problems making
> their systems understand their locale (i.e. it is broken in that it
> handles everything as the "C" locale) can state characters here that the
> locale would normally handle.
> 
> Examples:
>  extra_word_characters: _
>  extra_word_characters: "������"
> 
> (If you didn't get the last one, don't worry.)
> Specifying characters handled by the locale as isalpha would be a no-op.
> 
> Comments welcome.

Hi again!  It just occurred to me that the reason some people may have
a problem with accented letters on some systems, even with locale set,
may be the very same reason that Dan Dexter's htdig was hanging on one
of his documents.  I haven't received confirmation from Dan yet that my
hunch was correct, but as I told him, I don't see what else could cause
the hang.

My hunch was that on his system, the isalnum() function, called from
Configuration::Add(), does not recognise the ISO-8859-1 i-acute character
in an unquoted meta tag as alphanumeric, but isalpha() does recognise it
as alphabetic.  This would cause Configuration::Add(), which is called
to parse the meta tag parameters, to go into an infinite loop.

If this is indeed what's happening, the isalnum(*str) in that function
should be changed to isalpha(*str) || isdigit(*str) to ensure consistency.
What's more, there are TONS of calls (OK, 18 of them) to isalnum()
elsewhere in the code.  For consistency, these should all be changed,
it would seem.

Speculating even further on this hunch, I'd say that the term
alpha-numeric is a bit ambiguous.  If you're parsing a programming
language, for instance, you'd probably want isalnum() to work in a strict
ASCII sense of the word, allowing A-Z, a-z and 0-9, and rejecting accented
characters.  If you're parsing a document, it's a different matter.
Perhaps the various programmers that set up the locale information on
various systems started out with different assumptions about what this
function should do, hence the inconsistent behaviour.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.
Re: [htdig3-dev] Extra word-characters attribute: extra_word_characters

Reply via email to