Re: perl unicode support

Rich Felker Sat, 31 Mar 2007 21:20:58 -0800

On Sat, Mar 31, 2007 at 07:44:39PM -0400, Daniel B. wrote:
> Rich Felker wrote:
> > Again, software which does not handle corner cases correctly is crap.
> 
> Why are you confusing "special-case" with "corner case"?
> 
> I never said that software shouldn't handle corner cases such as illegal
> UTF-8 sequences.
> 
> I meant that an editor that handles illegal UTF-8 sequences other than
> by simply rejecting the edit request is a bit if a special case compared
> to general-purpose software, say a XML processor, for which some 
> specification requires (or recommends?) that the processor ignore or 
> reject any illegal sequences.  The software isn't failing to handle the 
> corner case; it is handling it--by explicitly rejecting it.


It is a corner case! Imagine a situation like this:

1. I open a file in my text editor for editing, unaware that it
contains invalid sequences.

2. The editor either silently clobbers them, or presents some sort of
warning (which, as a newbie, I will skip past as quickly as I can) and
then clobbers them.

3. I save the file, and suddenly I’ve irreversibly destroyed huge
amounts of data.

It’s simply not acceptable for opening a file and resaving it to not
yield exactly the same, byte-for-byte identical file, because it can
lead either to horrible data corruption or inability to edit when your
file has somehow gotten malformed data into it. If your editor
corrupts files like this, it’s broken and I would never even consider
using it.

As an example of broken behavior (but different from what you’re
talking about since it’s not UTF-8), XEmacs converts all characters to
its own nasty mule encoding when it loads the file. It proceeds to
clobber all Unicode characters which don’t also exist in legacy mule
character sets, and upon saving, the file is horribly destroyed. Yes
this situation is different, but the only difference is that UTF-8 is
a proper standard and mule is a horrible hack. The clobbering is just
as wrong either way.

(I’m hoping that XEmacs developers will fix this someday soon since I
otherwise love XEmacs, but this is pretty much a show-stopper since it
clobbers characters I actually use..)

> What I meant (given the quoted part below you replied before) was that 
> if you're dealing with a file that overall isn't valid UTF-8, how would 
> you know whether a particular part that looks like valid UTF-8, 
> representing some characters per the UTF-8 interpretation, really 
> represents those characters or is an erroneously mixed-in representation 
> of other characters in some other encoding?
> 
> Since you're talking about preserving what's there as opposed to doing
> anything more than that, I would guess you answer is that it really
> doesn't matter.  (Whether you treater 0xCF 0xBF as a correct the UTF-8 
> sequence and displayed the character U+03FF or, hypothetically, treated 
> it as an incorrectly-inserted Latin-1 encoding of U+00DF U+00BF and 
> displayed those characters, you'd still write the same bytes back out.) 

Yes, that’s exactly my answer. You might as well show it as the
character in case it really was supposed to be the character. Now it
sounds like we at least understand what one another are saying.

> > > example, if at one point you see the UTF-8-illegal byte sequence
> > > 0x00 0xBF and assume that that 0xBF byte means character U+00BF, then
> > 
> > This is incorrect. It means the byte 0xBF, and NOT ANY CHARACTER.
> 
> You said you're talking about a text editor, that reads bytes, displays 
> legal UTF-8 sequences as the characters they represent in UTF-8, doesn't
> reject other UTF-8-illegal bytes, and does something with those bytes.
> 
> What does it do with such a byte?  It seems you were taking about 
> mapping it to some character to display it.  Are you talking about 
> something else, such as displaying the hex value of the byte?

Yes. Actually GNU Emacs displays octal instead of hex, but it’s the
same idea. The pager “less” displays hex, such as <BF>, in reverse
video, and shows legal sequences that make up illegal or unprintable
codepoints in the form <U+D800> (also reverse video).

> Yes someone did--they wrote about rejecting spam mail by detecting
> bytes/octets with the high bit set.

Oh that was me. I misunderstood what you meant, sorry.

> > If youâ??re going to do this, at least map into the PUA rather than to
> > Latin-1..... At least that way itâ??s clear what the meaning is.
> 
> That makes it a bit less convenient, since then the numeric values of 
> the characters don't match the numeric values of the bytes.
> 
> But yes, doing all that is not something you'd want to escape into the
> wild (be seen outside the immediate code whether you need to fake
> byte-level regular expressions in Java).

*nod*

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: perl unicode support

Reply via email to