Hi Jim, > diff --git a/src/dfa.c b/src/dfa.c > index e28726d..8f79508 100644 > --- a/src/dfa.c > +++ b/src/dfa.c > @@ -1071,8 +1071,18 @@ parse_bracket_exp (void) > return CSET + charclass_index(ccl); > } > > +/* Add this to the test for whether a byte is word-constituent, since on > + BSD-based systems, many values in the 128..255 range are classified as > + alphabetic, while on glibc-based systems, they are not. */ > +#ifdef __GLIBC__ > +# define octet_valid_as_wide_char(c) 1 > +#else > +# define octet_valid_as_wide_char(c) (MBS_SUPPORT && btowc (c) != WEOF) > +#endif > + > /* Return non-zero if C is a `word-constituent' byte; zero otherwise. */ > -#define IS_WORD_CONSTITUENT(C) (isalnum(C) || (C) == '_') > +#define IS_WORD_CONSTITUENT(C) \ > + (octet_valid_as_wide_char(C) && (isalnum(C) || (C) == '_')) >
This code would do the job. Only, I find this macro name 'octet_valid_as_wide_char' confusing - because values such as 0xC3 are valid octets and also valid wide characters. I would call this macro 'is_valid_single_byte_character' or 'is_valid_unibyte_character'. Then it's clear why it has to map 0xC3 to false in UTF-8 encoding. Bruno -- In memoriam Ricardo Flores Magón <http://en.wikipedia.org/wiki/Ricardo_Flores_Magón>
