Paolo Bonzini <[email protected]> writes: > On 04/18/2010 06:32 AM, Ivan wrote: >> So... right now, "." means "valid UTF-8 character"? Or not? > > Yes, if your locale is UTF-8.
Wouldn't it be better to model encoding errors as characters? That is, if grep sees a byte that cannot possibly be the start of a character, we call it a "character" even though it is not in the standard Unicode character set. Internally, we could model it as (say) a negative number, the negative of the byte value (so it would be in the range -255 .. -128). Under this approach, the regular expression "." will match all nonempty lines, which is what most users expect. The current approach, where "." matches only lines that contain at least one valid UTF-8 character, is not nearly as useful or intuitive. This modeling could be done consistently in both regular expressions and in input. It's very easy to explain: surely it's much easier than whatever the current rules are.
