On Tue, Mar 27, 2007 at 06:31:11PM +0200, Egmont Koblinger wrote: > On Tue, Mar 27, 2007 at 11:16:58AM -0400, SrinTuar wrote: > > >That would be contradictory to the whole concept of Unicode. A > > >human-readable string should never be considered an array of bytes, it is > > >an > > >array of characters! > > > > Hrm, that statement I think I would object to. For the overwhelming > > vast majority of programs, strings are simply arrays of bytes. > > (regardless of encoding) > > In order to be able to write applications that correctly handle accented > letters, Unicode taught us the we clearly have to distinguish between bytes > and characters,
No, accented characters have nothing to do with the byte/character distinction. That applies to any non-ascii character. However, it only matters when you'll be performing display, editing, and pattern-based (not literal-string-based, though) searching. Accents and combining marks have to do with the character/grapheme distinction, which is pretty much relevant only for display. None of this is relevant to most processing of text which is just storage, retrieval, concatenation, and exact substring search. > and when handling texts we have to think in terms of > characters. These characters are eventually stored in memory or on disk as > several bytes, though. But in most of the cases you have to _think_ in > characters, otherwise it's quite unlikely that your application will work > correctly. It's the other way around, too: you have to think in terms of bytes. If you're thinking in terms of characters too much you'll end up doing noninvertable transformations and introduce vulnerabilities when data has been maliciously crafted not to be valid utf-8 (or just bugs due to normalizing data, etc.). > > The only time source code needs to care about > > characters is when it has to layout or format them for display. > > No, there are many more situations. Even if your job is so simple that you > only have to convert a text to uppercase, you already have to know what > encoding (and actually what locale) is being used. This is not a simple task at all, and in fact it's a task that a computer should (almost) never do... Case-insensitivity is bad enough, but case conversion is a horrible horrible mistake. Create your data in the case you want it in. The whole idea of case conversion in programming languages is digustingly euro-centric. The rest of the world doesn't have such a stupid thing as case... > Finding a particular > letter (especially in case insentitive mode), Hardly. A byte-based regex for all case matches (e.g. "(ä|Ä)") will work just as well even for case-insensitive matching, and literal character matching is simple substring matching identical to any other sane encoding. I get the impression you don't understand UTF-8.. > performing regexp matching, > alphabetical sorting etc. are just a few trivial examples where you must > think in characters. Character-based regex (which posix BRE/ERE is) needs to think in terms of characters. Sometimes a byte-based regex is also useful. For example my procmail rules reject mail containing any 8bit octets if there's not an appropriate mime type for it. This kills a lot of east asian spam. :) > > If perl did not have a "utf-8" bit on its scalars, it would probably > > handle utf-8 alot better and more naturally, imo. > > Probably. Probably not. I'm really unable to compare an existing programming > language with a hypothetical one. For example in PHP a string is simply a > sequence of bytes, and you have mb...() functions that handle them according > to the selected locale. I don't think it's either better or worse than perl, > it's just a different approach. Well it's definitely worse for someone who just wants text to work on their system without thinking about encoding. And it WILL just work (as evidenced by my disabling of the warning and still getting correct behavior) as long as the whole system is consistent, regardless of what encoding is used. Yes, strings need to distinguish byte/character data. But streams should not. A stream should accept bytes, and a character string should always be interpreted as bytes according to the machine's locale when read/written to a stream, or when incorporated into byte strings. > > When I write a basic little perl script that reads in lines from a > > file, does trivial string operations on them, then prints them back > > out, there should be absolutely no need for my code to make any > > special considerations for encoding. > > If none of these trivial string operations depend on the encoding then you > don't have to use this feature of perl, that's all. Simply make sure that > the file descriptors are not set to utf8, neither are the strings that you > concat or match to. etc, so you stay in world of pure bytes. But it should work even with strings interpreted as characters! There's no legitimate reason for it not to. Moreover, the warning is fundamentally stupid because it does not trigger for characters in the range 128-255, only >255. This is an implicit assumption that someone would want to use latin1, which is simply backwards and wrong. A program printing characters in latin1 without associating an encoding with the stream is equally "wrong" to a program writing arbitrary unicode characters. Rich -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
