On Wed, Mar 28, 2007 at 10:39:49PM -0400, Daniel B. wrote:
> > > Well of course you need to think in bytes when you're interpreting the
> > > stream of bytes as a stream of characters, which includes checking for
> > > invalid UTF-8 sequences.
> > 
> > And what do you do if they're present? 
> 
> Of course, it depends where they are present.  You seem to be addressing
> relative special cases.

I’m addressing corner cases. Robust systems engineering is ALWAYS
about handling the corner cases. Any stupid codemonkey can write code
that does what’s expected when you throw the expected input at it. The
problem is that coding like this blows up and gives your attacker root
as soon as they throw something unexpected at it. :)

> > Under your philosophy, it would
> > be impossible for me to remove files with invalid sequences in their
> > names, since I could neither type the filename nor match it with glob
> > patterns (due to the filename causing an error at the byte to
> > character conversion phase before there’s even a change to match
> > anything). ...
> 
> If the file name contains illegal byte sequences, then either they’re
> not in UTF-8 to start with or, if they’re supposed to be, something
> else let invalid sequences through.

Several likely scenarios:

1. Attacker intentionally created invalid filenames. This might just
   be annoying vandalism but on the other hand might be trying to
   trick non-robust code into doing something bad (maybe throwing away
   or replacing the invalid sequences so that the name collides with
   another filename, or interpreting overlong UTF-8 sequences, etc.).

2. Foolish user copied filenames from a foreign system (e.g. scp or
   rsync) with a different encoding, without conversion.

3. User (yourself or other) extracted files from a tar or zip archive
   with names encoded in a foreign encoding, without using software
   that could detect and correct the situation.

> If they're supposed to be UTF-8 and aren't, then certainly normal
> tools shouldn't have to deal with malformed sequences.

This is nonsense. Regardless of what they’re supposed to be, someone
could intentionally or unintentionally create files whose names are
not valid UTF-8. While it would be a nice kernel feature to make such
filenames illegal, you have to consider foreign removable media (where
someone might have already created such bad names), and since POSIX
makes no guarantee that strings which are illegal sequences in the
character encoding are illegal as filenames, any robust and portable
code MUST account for the the fact that they could exist. Thus
filenames, commandlines, etc. MUST always be handled as bytes or in a
way that preserves invalid sequences.

> If you write
> a special tool to fix malformed sequences somehow (e.g., delete files
> with malformed sequences), then of course you're going to be dealing
> with the byte level and not (just) the character level.

Why should I need a special tool to do this?? Something like:
    rm *known_ascii_substring*
should work, as long as the filename contains a unique ascii (or valid
UTF-8) substring.

> > Other similar problem: I open a file in a text editor and it contains
> > illegal sequences. For example, Markus Kuhn's UTF-8 decoder test file,
> 
> Again, you seem to be dealing with special cases.

Again, software which does not handle corner cases correctly is crap.

> If a UTF-8 decoder test file contains illegal byte UTF-8 sequences, why
> would you expect a UTF-8 text editor to work on it? 

I expect my text editor to be able to edit any file without corrupting
it. Perhaps you have lower expectations... If you’re used to Windows
Notepad, that would be natural, but I’m used to GNU Emacs.

> For the data that is parseable as a valid UTF-8 encoding of characters, 
> how do you propose to know whether it really is characters encoded as
> UTF-8 or is characters encoded some other way?   

It’s neither. It’s bytes, which when they are presented for editing,
are displayed as a character according to their interpretation as
UTF-8. :)

If I receive a foreign file in a legacy encoding and wish to interpret
it as characters in that encoding, then I’ll convert it to UTF-8 with
iconv (which deal with bytes) or using C-x RET c prefix in Emacs to
visit the file with a particular encoding. What I absolutely do NOT
want is for a file to “magically” be interpreted as Latin-1 or some
other legacy codepage as soon as invalid sequences are detected. This
is clobbering the functionality of my system to edit its own native
data for the sake of accomodating foreign data.

I respect that others do want and regularly use such auto-detection
functionality, however.

> (If you see the byte sequence 0xDF 0xBF, how do you know whether that 
> means the character U+003FF

It never means U+03FF in any case because U+03FF is 0xCF 0xBF...

> or the two characters U+00DF U+00BF?  For

It never means this in text on my system because the text encoding is
UTF-8. It would mean this only if your local character encoding were
Latin-1.

> example, if at one point you see the UTF-8-illegal byte sequence
> 0x00 0xBF and assume that that 0xBF byte means character U+00BF, then

This is incorrect. It means the byte 0xBF, and NOT ANY CHARACTER.
Assuming Latin-1 as soon as an illegal sequence is detected is
sometimes a useful hack, e.g. on IRC when some people are too stubborn
or uneducated to fix their encoding, but it’s fundamentally incorrect,
and in most cases will cause more harm than help in the long term. IRC
is a notable exception because the text is separated into individual
messages and you can selectively interpret individual messages in
legacy encodings without compromising the ability to accept valid
UTF-8.

> Either edit the test file with a byte editor, or edit it with a text 
> editor specifying an encoding that maps bytes to characters (e.g., 
> ISO 8859-1), even if those characters aren't the same characters the
> UTF-8-valid parts of the file represent.

With a byte editor, how am I supposed to see any characters except
ascii correctly? Do all the UTF-8 math in my head and then look them
up in a table?!?!

> > or a file with mixed encodings (e.g. a mail spool) or with mixed-in
> > binary data. I want to edit it anyway and save it back without
> > trashing the data that does not parse as valid UTF-8, while still
> > being able to edit the data that is valid UTF-8 as UTF-8.
> 
> If the file uses mixed encodings, then of course you can't read the
> entire file in one encoding.  

Indeed. But I’m thinking of cases like:
    cat file1.utf8 file2.latin1 file3.utf8 > foobar

Obviously this should not be done, but sometimes people are ignorant
and thus sometimes such files come to exist, and eventually arrive on
my system. :) What you’re saying is that I should have to use a ‘byte
editor’ (what is that? a hex editor?) to repair this file, rather than
just being able to load it like any other file in Emacs and edit it.
This strikes me as unnecessarily crippling.

> But when you determine that a given section (e.g., a MIME part) is
> in some given encoding, why not map that section's bytes to characters
> and then work with characters from then on?

Of course that’s preferred, but requires specialized tools. The core
strength of Unix is being able to use general tools for many purposes
beyond what they were originally intended for.

> > > > Hardly. A byte-based regex for all case matches (e.g. "(Ãf¤|Ãf?)") will
> > 
> > The fact that your mailer misinterpreted my UTF-8 as Latin-1 does not
> > instill faith...
> 
> Maybe you should think more clearly.  I didn't write my mailer, so the
> quality of its behavior doesn't reflect my knowledge.   

If you don’t actually use UTF-8, I think that reflects your lack of
qualification to talk about issues related to it. And I don’t see how
you could be using it if your mailer won’t even handle it...

> > > Of course, you can still do that with character-based strings if you
> > > can use other encodings.  (E.g., in Java, you can read the mail
> > > as ISO-8859-1, which maps bytes 0-255 to Unicode characters 0-255.
> > > Then you can write the regular expression in terms of Unicode characters
> > > 0-255.  The only disadvantage there is probably some time spent
> > > decoding the byte stream into the internal representation of characters.)
> > 
> > The biggest disadvantage of it is that it's WRONG. 
> 
> Is it any more wrong than your rejecting of 1xxxxxx bytes?  The bytes
> represent characters in some encoding.  You ignore those characters and
> reject based on just the byte values.  

Nobody said that it needs to be “rejected” (except Unicode fools who
think in terms of UTF-16...). It’s just not a character, but some
other binary data with no interpretation as a character.

> > The data is not
> > Latin-1, and pretending it's Latin-1 is a hideous hack. 
> 
> It's not pretending the data is bytes encoding characters.  It's mapping
> bytes to characters to use methods defined on characters.  Yes, it could
> be misleading if it's not clear that it's a temporary mapping only for
> that purpose (i.e., that the mapped-to characters are not the characters
> that the byte sequence really represents).  And yes, byte-based regular
> expressions would be useful.

If you’re going to do this, at least map into the PUA rather than to
Latin-1..... At least that way it’s clear what the meaning is.

〜Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to