Re: grep-2.10 testing

Bruno Haible Mon, 21 Nov 2011 05:55:50 -0800

Hi Jim,

> diff --git a/src/dfa.c b/src/dfa.c
> index e28726d..8f79508 100644
> --- a/src/dfa.c
> +++ b/src/dfa.c
> @@ -1071,8 +1071,18 @@ parse_bracket_exp (void)
>    return CSET + charclass_index(ccl);
>  }
> 
> +/* Add this to the test for whether a byte is word-constituent, since on
> +   BSD-based systems, many values in the 128..255 range are classified as
> +   alphabetic, while on glibc-based systems, they are not.  */
> +#ifdef __GLIBC__
> +# define octet_valid_as_wide_char(c) 1
> +#else
> +# define octet_valid_as_wide_char(c) (MBS_SUPPORT && btowc (c) != WEOF)
> +#endif
> +
>  /* Return non-zero if C is a `word-constituent' byte; zero otherwise.  */
> -#define IS_WORD_CONSTITUENT(C) (isalnum(C) || (C) == '_')
> +#define IS_WORD_CONSTITUENT(C) \
> +  (octet_valid_as_wide_char(C) && (isalnum(C) || (C) == '_'))
>


This code would do the job.

Only, I find this macro name 'octet_valid_as_wide_char' confusing -
because values such as 0xC3 are valid octets and also valid wide characters.
I would call this macro 'is_valid_single_byte_character' or
'is_valid_unibyte_character'. Then it's clear why it has to map 0xC3 to false
in UTF-8 encoding.

Bruno
-- 
In memoriam Ricardo Flores Magón 
<http://en.wikipedia.org/wiki/Ricardo_Flores_Magón>

Re: grep-2.10 testing

Reply via email to