Bruno Haible wrote: > Hi Jim, > >> diff --git a/src/dfa.c b/src/dfa.c >> index e28726d..8f79508 100644 >> --- a/src/dfa.c >> +++ b/src/dfa.c >> @@ -1071,8 +1071,18 @@ parse_bracket_exp (void) >> return CSET + charclass_index(ccl); >> } >> >> +/* Add this to the test for whether a byte is word-constituent, since on >> + BSD-based systems, many values in the 128..255 range are classified as >> + alphabetic, while on glibc-based systems, they are not. */ >> +#ifdef __GLIBC__ >> +# define octet_valid_as_wide_char(c) 1 >> +#else >> +# define octet_valid_as_wide_char(c) (MBS_SUPPORT && btowc (c) != WEOF) >> +#endif >> + >> /* Return non-zero if C is a `word-constituent' byte; zero otherwise. */ >> -#define IS_WORD_CONSTITUENT(C) (isalnum(C) || (C) == '_') >> +#define IS_WORD_CONSTITUENT(C) \ >> + (octet_valid_as_wide_char(C) && (isalnum(C) || (C) == '_')) >> > > This code would do the job. > > Only, I find this macro name 'octet_valid_as_wide_char' confusing - > because values such as 0xC3 are valid octets and also valid wide characters. > I would call this macro 'is_valid_single_byte_character' or > 'is_valid_unibyte_character'. Then it's clear why it has to map 0xC3 to false > in UTF-8 encoding.
Thanks. I prefer your names, too. I'll use is_valid_unibyte_character.
