On Tue, Mar 27, 2007 at 11:16:58AM -0400, SrinTuar wrote: > >That would be contradictory to the whole concept of Unicode. A > >human-readable string should never be considered an array of bytes, it is > >an > >array of characters! > > Hrm, that statement I think I would object to. For the overwhelming > vast majority of programs, strings are simply arrays of bytes. > (regardless of encoding)
In order to be able to write applications that correctly handle accented letters, Unicode taught us the we clearly have to distinguish between bytes and characters, and when handling texts we have to think in terms of characters. These characters are eventually stored in memory or on disk as several bytes, though. But in most of the cases you have to _think_ in characters, otherwise it's quite unlikely that your application will work correctly. > The only time source code needs to care about > characters is when it has to layout or format them for display. No, there are many more situations. Even if your job is so simple that you only have to convert a text to uppercase, you already have to know what encoding (and actually what locale) is being used. Finding a particular letter (especially in case insentitive mode), performing regexp matching, alphabetical sorting etc. are just a few trivial examples where you must think in characters. > If perl did not have a "utf-8" bit on its scalars, it would probably > handle utf-8 alot better and more naturally, imo. Probably. Probably not. I'm really unable to compare an existing programming language with a hypothetical one. For example in PHP a string is simply a sequence of bytes, and you have mb...() functions that handle them according to the selected locale. I don't think it's either better or worse than perl, it's just a different approach. > When I write a basic little perl script that reads in lines from a > file, does trivial string operations on them, then prints them back > out, there should be absolutely no need for my code to make any > special considerations for encoding. If none of these trivial string operations depend on the encoding then you don't have to use this feature of perl, that's all. Simply make sure that the file descriptors are not set to utf8, neither are the strings that you concat or match to. etc, so you stay in world of pure bytes. -- Egmont -- Linux-UTF8: i18n of Linux on all levels Archive: http://mail.nl.linux.org/linux-utf8/
