On Wed, Mar 28, 2007 at 10:39:49PM -0400, Daniel B. wrote: > > > Well of course you need to think in bytes when you're interpreting the > > > stream of bytes as a stream of characters, which includes checking for > > > invalid UTF-8 sequences. > > > > And what do you do if they're present? > > Of course, it depends where they are present. You seem to be addressing > relative special cases.
I’m addressing corner cases. Robust systems engineering is ALWAYS about handling the corner cases. Any stupid codemonkey can write code that does what’s expected when you throw the expected input at it. The problem is that coding like this blows up and gives your attacker root as soon as they throw something unexpected at it. :) > > Under your philosophy, it would > > be impossible for me to remove files with invalid sequences in their > > names, since I could neither type the filename nor match it with glob > > patterns (due to the filename causing an error at the byte to > > character conversion phase before there’s even a change to match > > anything). ... > > If the file name contains illegal byte sequences, then either they’re > not in UTF-8 to start with or, if they’re supposed to be, something > else let invalid sequences through. Several likely scenarios: 1. Attacker intentionally created invalid filenames. This might just be annoying vandalism but on the other hand might be trying to trick non-robust code into doing something bad (maybe throwing away or replacing the invalid sequences so that the name collides with another filename, or interpreting overlong UTF-8 sequences, etc.). 2. Foolish user copied filenames from a foreign system (e.g. scp or rsync) with a different encoding, without conversion. 3. User (yourself or other) extracted files from a tar or zip archive with names encoded in a foreign encoding, without using software that could detect and correct the situation. > If they're supposed to be UTF-8 and aren't, then certainly normal > tools shouldn't have to deal with malformed sequences. This is nonsense. Regardless of what they’re supposed to be, someone could intentionally or unintentionally create files whose names are not valid UTF-8. While it would be a nice kernel feature to make such filenames illegal, you have to consider foreign removable media (where someone might have already created such bad names), and since POSIX makes no guarantee that strings which are illegal sequences in the character encoding are illegal as filenames, any robust and portable code MUST account for the the fact that they could exist. Thus filenames, commandlines, etc. MUST always be handled as bytes or in a way that preserves invalid sequences. > If you write > a special tool to fix malformed sequences somehow (e.g., delete files > with malformed sequences), then of course you're going to be dealing > with the byte level and not (just) the character level. Why should I need a special tool to do this?? Something like: rm *known_ascii_substring* should work, as long as the filename contains a unique ascii (or valid UTF-8) substring. > > Other similar problem: I open a file in a text editor and it contains > > illegal sequences. For example, Markus Kuhn's UTF-8 decoder test file, > > Again, you seem to be dealing with special cases. Again, software which does not handle corner cases correctly is crap. > If a UTF-8 decoder test file contains illegal byte UTF-8 sequences, why > would you expect a UTF-8 text editor to work on it? I expect my text editor to be able to edit any file without corrupting it. Perhaps you have lower expectations... If you’re used to Windows Notepad, that would be natural, but I’m used to GNU Emacs. > For the data that is parseable as a valid UTF-8 encoding of characters, > how do you propose to know whether it really is characters encoded as > UTF-8 or is characters encoded some other way? It’s neither. It’s bytes, which when they are presented for editing, are displayed as a character according to their interpretation as UTF-8. :) If I receive a foreign file in a legacy encoding and wish to interpret it as characters in that encoding, then I’ll convert it to UTF-8 with iconv (which deal with bytes) or using C-x RET c prefix in Emacs to visit the file with a particular encoding. What I absolutely do NOT want is for a file to “magically” be interpreted as Latin-1 or some other legacy codepage as soon as invalid sequences are detected. This is clobbering the functionality of my system to edit its own native data for the sake of accomodating foreign data. I respect that others do want and regularly use such auto-detection functionality, however. > (If you see the byte sequence 0xDF 0xBF, how do you know whether that > means the character U+003FF It never means U+03FF in any case because U+03FF is 0xCF 0xBF... > or the two characters U+00DF U+00BF? For It never means this in text on my system because the text encoding is UTF-8. It would mean this only if your local character encoding were Latin-1. > example, if at one point you see the UTF-8-illegal byte sequence > 0x00 0xBF and assume that that 0xBF byte means character U+00BF, then This is incorrect. It means the byte 0xBF, and NOT ANY CHARACTER. Assuming Latin-1 as soon as an illegal sequence is detected is sometimes a useful hack, e.g. on IRC when some people are too stubborn or uneducated to fix their encoding, but it’s fundamentally incorrect, and in most cases will cause more harm than help in the long term. IRC is a notable exception because the text is separated into individual messages and you can selectively interpret individual messages in legacy encodings without compromising the ability to accept valid UTF-8. > Either edit the test file with a byte editor, or edit it with a text > editor specifying an encoding that maps bytes to characters (e.g., > ISO 8859-1), even if those characters aren't the same characters the > UTF-8-valid parts of the file represent. With a byte editor, how am I supposed to see any characters except ascii correctly? Do all the UTF-8 math in my head and then look them up in a table?!?! > > or a file with mixed encodings (e.g. a mail spool) or with mixed-in > > binary data. I want to edit it anyway and save it back without > > trashing the data that does not parse as valid UTF-8, while still > > being able to edit the data that is valid UTF-8 as UTF-8. > > If the file uses mixed encodings, then of course you can't read the > entire file in one encoding. Indeed. But I’m thinking of cases like: cat file1.utf8 file2.latin1 file3.utf8 > foobar Obviously this should not be done, but sometimes people are ignorant and thus sometimes such files come to exist, and eventually arrive on my system. :) What you’re saying is that I should have to use a ‘byte editor’ (what is that? a hex editor?) to repair this file, rather than just being able to load it like any other file in Emacs and edit it. This strikes me as unnecessarily crippling. > But when you determine that a given section (e.g., a MIME part) is > in some given encoding, why not map that section's bytes to characters > and then work with characters from then on? Of course that’s preferred, but requires specialized tools. The core strength of Unix is being able to use general tools for many purposes beyond what they were originally intended for. > > > > Hardly. A byte-based regex for all case matches (e.g. "(Ãf¤|Ãf?)") will > > > > The fact that your mailer misinterpreted my UTF-8 as Latin-1 does not > > instill faith... > > Maybe you should think more clearly. I didn't write my mailer, so the > quality of its behavior doesn't reflect my knowledge. If you don’t actually use UTF-8, I think that reflects your lack of qualification to talk about issues related to it. And I don’t see how you could be using it if your mailer won’t even handle it... > > > Of course, you can still do that with character-based strings if you > > > can use other encodings. (E.g., in Java, you can read the mail > > > as ISO-8859-1, which maps bytes 0-255 to Unicode characters 0-255. > > > Then you can write the regular expression in terms of Unicode characters > > > 0-255. The only disadvantage there is probably some time spent > > > decoding the byte stream into the internal representation of characters.) > > > > The biggest disadvantage of it is that it's WRONG. > > Is it any more wrong than your rejecting of 1xxxxxx bytes? The bytes > represent characters in some encoding. You ignore those characters and > reject based on just the byte values. Nobody said that it needs to be “rejected” (except Unicode fools who think in terms of UTF-16...). It’s just not a character, but some other binary data with no interpretation as a character. > > The data is not > > Latin-1, and pretending it's Latin-1 is a hideous hack. > > It's not pretending the data is bytes encoding characters. It's mapping > bytes to characters to use methods defined on characters. Yes, it could > be misleading if it's not clear that it's a temporary mapping only for > that purpose (i.e., that the mapped-to characters are not the characters > that the byte sequence really represents). And yes, byte-based regular > expressions would be useful. If you’re going to do this, at least map into the PUA rather than to Latin-1..... At least that way it’s clear what the meaning is. 〜Rich -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/