Rich Felker wrote: > > > > Other similar problem: I open a file in a text editor and it contains > > > illegal sequences. For example, Markus Kuhn's UTF-8 decoder test file, > > > > Again, you seem to be dealing with special cases. > > Again, software which does not handle corner cases correctly is crap.
Why are you confusing "special-case" with "corner case"? I never said that software shouldn't handle corner cases such as illegal UTF-8 sequences. I meant that an editor that handles illegal UTF-8 sequences other than by simply rejecting the edit request is a bit if a special case compared to general-purpose software, say a XML processor, for which some specification requires (or recommends?) that the processor ignore or reject any illegal sequences. The software isn't failing to handle the corner case; it is handling it--by explicitly rejecting it. > > If a UTF-8 decoder test file contains illegal byte UTF-8 sequences, why > > would you expect a UTF-8 text editor to work on it? > > I expect my text editor to be able to edit any file without corrupting > it. Okay, then it's not UTF-8-only text editor. > Perhaps you have lower expectations... If youâ??re used to Windows > Notepad, that would be natural, but Iâ??m used to GNU Emacs. I'm used to Emacs too. Quit casting implied aspersions. > > (If you see the byte sequence 0xDF 0xBF, how do you know whether that > > means the character U+003FF > > It never means U+03FF in any case because U+03FF is 0xCF 0xBF... > > > or the two characters U+00DF U+00BF? For > > It never means this in text on my system because the text encoding is > UTF-8. It would mean this only if your local character encoding were > Latin-1. What I meant (given the quoted part below you replied before) was that if you're dealing with a file that overall isn't valid UTF-8, how would you know whether a particular part that looks like valid UTF-8, representing some characters per the UTF-8 interpretation, really represents those characters or is an erroneously mixed-in representation of other characters in some other encoding? Since you're talking about preserving what's there as opposed to doing anything more than that, I would guess you answer is that it really doesn't matter. (Whether you treater 0xCF 0xBF as a correct the UTF-8 sequence and displayed the character U+03FF or, hypothetically, treated it as an incorrectly-inserted Latin-1 encoding of U+00DF U+00BF and displayed those characters, you'd still write the same bytes back out.) > > example, if at one point you see the UTF-8-illegal byte sequence > > 0x00 0xBF and assume that that 0xBF byte means character U+00BF, then > > This is incorrect. It means the byte 0xBF, and NOT ANY CHARACTER. You said you're talking about a text editor, that reads bytes, displays legal UTF-8 sequences as the characters they represent in UTF-8, doesn't reject other UTF-8-illegal bytes, and does something with those bytes. What does it do with such a byte? It seems you were taking about mapping it to some character to display it. Are you talking about something else, such as displaying the hex value of the byte? > > > > Of course, you can still do that with character-based strings if you > > > > can use other encodings. (E.g., in Java, you can read the mail > > > > as ISO-8859-1, which maps bytes 0-255 to Unicode characters 0-255. > > > > Then you can write the regular expression in terms of Unicode characters > > > > 0-255. The only disadvantage there is probably some time spent > > > > decoding the byte stream into the internal representation of > > > > characters.) > > > > > > The biggest disadvantage of it is that it's WRONG. > > > > Is it any more wrong than your rejecting of 1xxxxxx bytes? The bytes > > represent characters in some encoding. You ignore those characters and > > reject based on just the byte values. > > Nobody said that it needs to be â??rejectedâ?? ... Yes someone did--they wrote about rejecting spam mail by detecting bytes/octets with the high bit set. ... > > > > The data is not > > > Latin-1, and pretending it's Latin-1 is a hideous hack. > > > > It's not pretending the data is bytes encoding characters. It's mapping > > bytes to characters to use methods defined on characters. Yes, it could > > be misleading if it's not clear that it's a temporary mapping only for > > that purpose (i.e., that the mapped-to characters are not the characters > > that the byte sequence really represents). And yes, byte-based regular > > expressions would be useful. > > If youâ??re going to do this, at least map into the PUA rather than to > Latin-1..... At least that way itâ??s clear what the meaning is. That makes it a bit less convenient, since then the numeric values of the characters don't match the numeric values of the bytes. But yes, doing all that is not something you'd want to escape into the wild (be seen outside the immediate code whether you need to fake byte-level regular expressions in Java). Daniel -- Daniel Barclay [EMAIL PROTECTED] -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
