On Sat, Mar 31, 2007 at 07:44:39PM -0400, Daniel B. wrote: > Rich Felker wrote: > > Again, software which does not handle corner cases correctly is crap. > > Why are you confusing "special-case" with "corner case"? > > I never said that software shouldn't handle corner cases such as illegal > UTF-8 sequences. > > I meant that an editor that handles illegal UTF-8 sequences other than > by simply rejecting the edit request is a bit if a special case compared > to general-purpose software, say a XML processor, for which some > specification requires (or recommends?) that the processor ignore or > reject any illegal sequences. The software isn't failing to handle the > corner case; it is handling it--by explicitly rejecting it.
It is a corner case! Imagine a situation like this: 1. I open a file in my text editor for editing, unaware that it contains invalid sequences. 2. The editor either silently clobbers them, or presents some sort of warning (which, as a newbie, I will skip past as quickly as I can) and then clobbers them. 3. I save the file, and suddenly I’ve irreversibly destroyed huge amounts of data. It’s simply not acceptable for opening a file and resaving it to not yield exactly the same, byte-for-byte identical file, because it can lead either to horrible data corruption or inability to edit when your file has somehow gotten malformed data into it. If your editor corrupts files like this, it’s broken and I would never even consider using it. As an example of broken behavior (but different from what you’re talking about since it’s not UTF-8), XEmacs converts all characters to its own nasty mule encoding when it loads the file. It proceeds to clobber all Unicode characters which don’t also exist in legacy mule character sets, and upon saving, the file is horribly destroyed. Yes this situation is different, but the only difference is that UTF-8 is a proper standard and mule is a horrible hack. The clobbering is just as wrong either way. (I’m hoping that XEmacs developers will fix this someday soon since I otherwise love XEmacs, but this is pretty much a show-stopper since it clobbers characters I actually use..) > What I meant (given the quoted part below you replied before) was that > if you're dealing with a file that overall isn't valid UTF-8, how would > you know whether a particular part that looks like valid UTF-8, > representing some characters per the UTF-8 interpretation, really > represents those characters or is an erroneously mixed-in representation > of other characters in some other encoding? > > Since you're talking about preserving what's there as opposed to doing > anything more than that, I would guess you answer is that it really > doesn't matter. (Whether you treater 0xCF 0xBF as a correct the UTF-8 > sequence and displayed the character U+03FF or, hypothetically, treated > it as an incorrectly-inserted Latin-1 encoding of U+00DF U+00BF and > displayed those characters, you'd still write the same bytes back out.) Yes, that’s exactly my answer. You might as well show it as the character in case it really was supposed to be the character. Now it sounds like we at least understand what one another are saying. > > > example, if at one point you see the UTF-8-illegal byte sequence > > > 0x00 0xBF and assume that that 0xBF byte means character U+00BF, then > > > > This is incorrect. It means the byte 0xBF, and NOT ANY CHARACTER. > > You said you're talking about a text editor, that reads bytes, displays > legal UTF-8 sequences as the characters they represent in UTF-8, doesn't > reject other UTF-8-illegal bytes, and does something with those bytes. > > What does it do with such a byte? It seems you were taking about > mapping it to some character to display it. Are you talking about > something else, such as displaying the hex value of the byte? Yes. Actually GNU Emacs displays octal instead of hex, but it’s the same idea. The pager “less” displays hex, such as <BF>, in reverse video, and shows legal sequences that make up illegal or unprintable codepoints in the form <U+D800> (also reverse video). > Yes someone did--they wrote about rejecting spam mail by detecting > bytes/octets with the high bit set. Oh that was me. I misunderstood what you meant, sorry. > > If youâ??re going to do this, at least map into the PUA rather than to > > Latin-1..... At least that way itâ??s clear what the meaning is. > > That makes it a bit less convenient, since then the numeric values of > the characters don't match the numeric values of the bytes. > > But yes, doing all that is not something you'd want to escape into the > wild (be seen outside the immediate code whether you need to fake > byte-level regular expressions in Java). *nod* Rich -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/