On Tue, Mar 27, 2007 at 11:16:58AM -0400, SrinTuar wrote:
> >That would be contradictory to the whole concept of Unicode. A
> >human-readable string should never be considered an array of bytes, it is 
> >an
> >array of characters!
> 
> Hrm, that statement I think I would object to. For the overwhelming
> vast majority of programs, strings are simply arrays of bytes.
> (regardless of encoding)

In order to be able to write applications that correctly handle accented
letters, Unicode taught us the we clearly have to distinguish between bytes
and characters, and when handling texts we have to think in terms of
characters. These characters are eventually stored in memory or on disk as
several bytes, though. But in most of the cases you have to _think_ in
characters, otherwise it's quite unlikely that your application will work
correctly.

> The only time source code needs to care about
> characters is when it has to layout or format them for display.

No, there are many more situations. Even if your job is so simple that you
only have to convert a text to uppercase, you already have to know what
encoding (and actually what locale) is being used. Finding a particular
letter (especially in case insentitive mode), performing regexp matching,
alphabetical sorting etc. are just a few trivial examples where you must
think in characters.

> If perl did not have a "utf-8" bit on its scalars, it would probably
> handle utf-8 alot better and more naturally, imo.

Probably. Probably not. I'm really unable to compare an existing programming
language with a hypothetical one. For example in PHP a string is simply a
sequence of bytes, and you have mb...() functions that handle them according
to the selected locale. I don't think it's either better or worse than perl,
it's just a different approach.


> When I write a basic little perl script that reads in lines from a
> file, does trivial string operations on them, then prints them back
> out, there should be absolutely no need for my code to make any
> special considerations for encoding.

If none of these trivial string operations depend on the encoding then you
don't have to use this feature of perl, that's all. Simply make sure that
the file descriptors are not set to utf8, neither are the strings that you
concat or match to. etc, so you stay in world of pure bytes.



-- 
Egmont

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to