On Sat, May 3, 2014 at 5:07 PM, Rich Felker <[email protected]> wrote: >> Lets refuse to find end of line if there is a non UTF-8 sequence inside that >> line? >> Sounds wrong to me... > > sed (also regcomp and regexec) requires text input. Byte streams with > illegal sequences are not text. Actually since the regex is not trying > to match the illegal sequence, just the end-of-line, it would > theoretically be possible to make this work (and it will once we > overhaul the regex implementation to work with byte-based DFA's rather > than character-based ones), but that doesn't change the fact that it's > an invalid test.
Language lawyering is less important that real world usage. Adding a char to each line of text is a quite reasonable thing to do. Having occasional UTF-8 violations in text files is not rare too. Linux kernel source code has 57 instances of it in *.c and *.h files. _______________________________________________ busybox mailing list [email protected] http://lists.busybox.net/mailman/listinfo/busybox
