Rich Felker wrote: > > On Tue, Mar 27, 2007 at 10:55:32PM -0400, Daniel B. wrote: > > Rich Felker wrote: > > > ... > > > None of this is relevant to most processing of text which is just > > > storage, retrieval, concatenation, and exact substring search. > > > > It might be true that more-complicated processing is not relevant to those > > operations. (I'm not 100% sure about exact substring matches, but maybe > > if the byte string given to search for is proper (e.g., doesn't have any > > partial representations of characters), it's okay). > > No character is a substring of another character in UTF-8. ...
I know. That's why I addresses avoiding _partial_ byte sequences. > > Well of course you need to think in bytes when you're interpreting the > > stream of bytes as a stream of characters, which includes checking for > > invalid UTF-8 sequences. > > And what do you do if they're present? Of course, it depends where they are present. You seem to be addressing relative special cases. > Under your philosophy, it would > be impossible for me to remove files with invalid sequences in their > names, since I could neither type the filename nor match it with glob > patterns (due to the filename causing an error at the byte to > character conversion phase before there's even a change to match > anything). ... If the file name contains illegal byte sequences, then either they're not in UTF-8 to start with or, if they're supposed to be, something else let invalid sequences through. If they're not always in UTF-8 (if they're sometimes in a different encoding), then why would you be interpreting them as UTF-8 (why would you hit the case where it seems there's an illegal sequence)? (How do you know what encoding they're in? Or are you dealing with the problem of not having any specification of what the encoding really is and having to guess?) If they're supposed to be UTF-8 and aren't, then certainly normal tools shouldn't have to deal with malformed sequences. If you write a special tool to fix malformed sequences somehow (e.g., delete files with malformed sequences), then of course you're going to be dealing with the byte level and not (just) the character level. > Other similar problem: I open a file in a text editor and it contains > illegal sequences. For example, Markus Kuhn's UTF-8 decoder test file, Again, you seem to be dealing with special cases. If a UTF-8 decoder test file contains illegal byte UTF-8 sequences, why would you expect a UTF-8 text editor to work on it? For the data that is parseable as a valid UTF-8 encoding of characters, how do you propose to know whether it really is characters encoded as UTF-8 or is characters encoded some other way? (If you see the byte sequence 0xDF 0xBF, how do you know whether that means the character U+003FF or the two characters U+00DF U+00BF? For example, if at one point you see the UTF-8-illegal byte sequence 0x00 0xBF and assume that that 0xBF byte means character U+00BF, then if you see the UTF-8-legal byte sequence 0xDF 0xBF, how would you know that that 0xBF byte also represents U+00BF vs. whether it's really part of the representation of character U+003FF?) Either edit the test file with a byte editor, or edit it with a text editor specifying an encoding that maps bytes to characters (e.g., ISO 8859-1), even if those characters aren't the same characters the UTF-8-valid parts of the file represent. > or a file with mixed encodings (e.g. a mail spool) or with mixed-in > binary data. I want to edit it anyway and save it back without > trashing the data that does not parse as valid UTF-8, while still > being able to edit the data that is valid UTF-8 as UTF-8. If the file uses mixed encodings, then of course you can't read the entire file in one encoding. But when you determine that a given section (e.g., a MIME part) is in some given encoding, why not map that section's bytes to characters and then work with characters from then on? What if you're searching for a character string across multiple sections of a mixed-encoding file like that? You certainly can't write a UTF-8- byte regular expression that matches other encodings. And you can't "OR" together the regular expression for a UTF-8 byte encoding of the character string with each other encoding's byte sequences (since there's no way for the regular expression matcher to know which alternative in the regular expression it should be using in each section of the mixed-encoding file. ... > > > Hardly. A byte-based regex for all case matches (e.g. "(Ãf¤|Ãf?)") will > > The fact that your mailer misinterpreted my UTF-8 as Latin-1 does not > instill faith... Maybe you should think more clearly. I didn't write my mailer, so the quality of its behavior doesn't reflect my knowledge. > > >... Sometimes a byte-based regex is also useful. For > > > example my procmail rules reject mail containing any 8bit octets if > > > there's not an appropriate mime type for it. This kills a lot of east > > > asian spam. :) > > > > Yep. > > > > Of course, you can still do that with character-based strings if you > > can use other encodings. (E.g., in Java, you can read the mail > > as ISO-8859-1, which maps bytes 0-255 to Unicode characters 0-255. > > Then you can write the regular expression in terms of Unicode characters > > 0-255. The only disadvantage there is probably some time spent > > decoding the byte stream into the internal representation of characters.) > > The biggest disadvantage of it is that it's WRONG. Is it any more wrong than your rejecting of 1xxxxxx bytes? The bytes represent characters in some encoding. You ignore those characters and reject based on just the byte values. > The data is not > Latin-1, and pretending it's Latin-1 is a hideous hack. It's not pretending the data is bytes encoding characters. It's mapping bytes to characters to use methods defined on characters. Yes, it could be misleading if it's not clear that it's a temporary mapping only for that purpose (i.e., that the mapped-to characters are not the characters that the byte sequence really represents). And yes, byte-based regular expressions would be useful. > Maybe 20 years from now we'll finally be able to get rid of the > nonsense and just assume everything is UTF-8... Hopefully. Daniel -- Daniel Barclay [EMAIL PROTECTED] -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/