Rich Felker wrote: > ... > None of this is relevant to most processing of text which is just > storage, retrieval, concatenation, and exact substring search.
It might be true that more-complicated processing is not relevant to those operations. (I'm not 100% sure about exact substring matches, but maybe if the byte string given to search for is proper (e.g., doesn't have any partial representations of characters), it's okay). However, I think you're stretching things too much to say "most." (I guess it depends on what we're calling "text processing".) > > > ... But in most of the cases you have to _think_ in > > characters, otherwise it's quite unlikely that your application will work > > correctly. > > It's the other way around, too: you have to think in terms of bytes. > If you're thinking in terms of characters too much you'll end up doing > noninvertable transformations and introduce vulnerabilities when data > has been maliciously crafted not to be valid utf-8 (or just bugs due > to normalizing data, etc.). Well of course you need to think in bytes when you're interpreting the stream of bytes as a stream of characters, which includes checking for invalid UTF-8 sequences. Once you've checked that the bytes properly represent characters, from then on you need to think in characters unless you're doing sufficiently simple operations (e.g., your list above). > Hardly. A byte-based regex for all case matches (e.g. "(ä|�)") will > work just as well even for case-insensitive matching, and literal > character matching is simple substring matching identical to any other > sane encoding. I get the impression you don't understand UTF-8.. How do you match a single character? Would you want the programmer to have to write an expression that matches a byte 0x00 through 0x7F, a sequence of two bytes from 0xC2 0x80 through 0xDF 0xBF, a sequence of three bytes from 0xE1 0xA0 0x80 through 0xEF 0xBF 0xBF, etc. [hoping I got those bytes right] instead of simply "."? >... Sometimes a byte-based regex is also useful. For > example my procmail rules reject mail containing any 8bit octets if > there's not an appropriate mime type for it. This kills a lot of east > asian spam. :) Yep. Of course, you can still do that with character-based strings if you can use other encodings. (E.g., in Java, you can read the mail as ISO-8859-1, which maps bytes 0-255 to Unicode characters 0-255. Then you can write the regular expression in terms of Unicode characters 0-255. The only disadvantage there is probably some time spent decoding the byte stream into the internal representation of characters.) Maybe the net result from your point is that one should be able to read byte streams in encodings other than just UTF-8. (A language might do that by converting anything else into UTF-8, or could use a different internal representation (e.g., as Java uses UTF-16).) > A stream should accept bytes, and a character string > should always be interpreted as bytes according to the machine's > locale when read/written to a stream Note that it's specific to the stream, not the machine. (Consider for example, HTTP's Content-Encoding header's charset parameter. A web browser needs to handle different character encodings in different responses. A MIME application needs to handle different character encodings in different parts of a single multi-part message.) Daniel -- Daniel Barclay [EMAIL PROTECTED] -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
