Re: bug in busybox sed with non-ascii chars

Denys Vlasenko Sun, 04 May 2014 07:45:06 -0700

On Sat, May 3, 2014 at 5:07 PM, Rich Felker <[email protected]> wrote:
>> Lets refuse to find end of line if there is a non UTF-8 sequence inside that 
>> line?
>> Sounds wrong to me...
>
> sed (also regcomp and regexec) requires text input. Byte streams with
> illegal sequences are not text. Actually since the regex is not trying
> to match the illegal sequence, just the end-of-line, it would
> theoretically be possible to make this work (and it will once we
> overhaul the regex implementation to work with byte-based DFA's rather
> than character-based ones), but that doesn't change the fact that it's
> an invalid test.


Language lawyering is less important that real world usage.

Adding a char to each line of text is a quite reasonable thing to do.

Having occasional UTF-8 violations in text files is not rare too.
Linux kernel source code has 57 instances of it in *.c and *.h files.
_______________________________________________
busybox mailing list
[email protected]
http://lists.busybox.net/mailman/listinfo/busybox

Re: bug in busybox sed with non-ascii chars

Reply via email to