On Fri, Apr 23, 2010 at 22:51, Paul Eggert <[email protected]> wrote: > Paolo Bonzini <[email protected]> writes: > >> On 04/18/2010 06:32 AM, Ivan wrote: >>> So... right now, "." means "valid UTF-8 character"? Or not? >> >> Yes, if your locale is UTF-8. > > Wouldn't it be better to model encoding errors as characters? That is, > if grep sees a byte that cannot possibly be the start of a character, we > call it a "character" even though it is not in the standard Unicode > character set. Internally, we could model it as (say) a negative > number, the negative of the byte value (so it would be in the range -255 > .. -128).
This would have to be changed in glibc first, and then in dfa.c. Encoding errors in the regex are supported, but . doesn't capture an invalid character. Paolo
