Re: removing blank lines: "grep ." is really slow

Paul Eggert Fri, 23 Apr 2010 13:52:16 -0700

Paolo Bonzini <[email protected]> writes:

> On 04/18/2010 06:32 AM, Ivan wrote:
>> So... right now, "." means "valid UTF-8 character"? Or not?
>
> Yes, if your locale is UTF-8.


Wouldn't it be better to model encoding errors as characters?  That is,
if grep sees a byte that cannot possibly be the start of a character, we
call it a "character" even though it is not in the standard Unicode
character set.  Internally, we could model it as (say) a negative
number, the negative of the byte value (so it would be in the range -255
.. -128).

Under this approach, the regular expression "." will match all nonempty
lines, which is what most users expect.  The current approach, where "."
matches only lines that contain at least one valid UTF-8 character, is
not nearly as useful or intuitive.

This modeling could be done consistently in both regular expressions and
in input.  It's very easy to explain: surely it's much easier than
whatever the current rules are.

Re: removing blank lines: "grep ." is really slow

Reply via email to