On Tue, Mar 27, 2007 at 06:31:11PM +0200, Egmont Koblinger wrote:
> On Tue, Mar 27, 2007 at 11:16:58AM -0400, SrinTuar wrote:
> > >That would be contradictory to the whole concept of Unicode. A
> > >human-readable string should never be considered an array of bytes, it is 
> > >an
> > >array of characters!
> > 
> > Hrm, that statement I think I would object to. For the overwhelming
> > vast majority of programs, strings are simply arrays of bytes.
> > (regardless of encoding)
> 
> In order to be able to write applications that correctly handle accented
> letters, Unicode taught us the we clearly have to distinguish between bytes
> and characters,

No, accented characters have nothing to do with the byte/character
distinction. That applies to any non-ascii character. However, it only
matters when you'll be performing display, editing, and pattern-based
(not literal-string-based, though) searching.

Accents and combining marks have to do with the character/grapheme
distinction, which is pretty much relevant only for display.

None of this is relevant to most processing of text which is just
storage, retrieval, concatenation, and exact substring search.

> and when handling texts we have to think in terms of
> characters. These characters are eventually stored in memory or on disk as
> several bytes, though. But in most of the cases you have to _think_ in
> characters, otherwise it's quite unlikely that your application will work
> correctly.

It's the other way around, too: you have to think in terms of bytes.
If you're thinking in terms of characters too much you'll end up doing
noninvertable transformations and introduce vulnerabilities when data
has been maliciously crafted not to be valid utf-8 (or just bugs due
to normalizing data, etc.).

> > The only time source code needs to care about
> > characters is when it has to layout or format them for display.
> 
> No, there are many more situations. Even if your job is so simple that you
> only have to convert a text to uppercase, you already have to know what
> encoding (and actually what locale) is being used.

This is not a simple task at all, and in fact it's a task that a
computer should (almost) never do... Case-insensitivity is bad enough,
but case conversion is a horrible horrible mistake. Create your data
in the case you want it in.

The whole idea of case conversion in programming languages is
digustingly euro-centric. The rest of the world doesn't have such a
stupid thing as case...

> Finding a particular
> letter (especially in case insentitive mode),

Hardly. A byte-based regex for all case matches (e.g. "(ä|Ä)") will
work just as well even for case-insensitive matching, and literal
character matching is simple substring matching identical to any other
sane encoding. I get the impression you don't understand UTF-8..

> performing regexp matching,
> alphabetical sorting etc. are just a few trivial examples where you must
> think in characters.

Character-based regex (which posix BRE/ERE is) needs to think in terms
of characters. Sometimes a byte-based regex is also useful. For
example my procmail rules reject mail containing any 8bit octets if
there's not an appropriate mime type for it. This kills a lot of east
asian spam. :)

> > If perl did not have a "utf-8" bit on its scalars, it would probably
> > handle utf-8 alot better and more naturally, imo.
> 
> Probably. Probably not. I'm really unable to compare an existing programming
> language with a hypothetical one. For example in PHP a string is simply a
> sequence of bytes, and you have mb...() functions that handle them according
> to the selected locale. I don't think it's either better or worse than perl,
> it's just a different approach.

Well it's definitely worse for someone who just wants text to work on
their system without thinking about encoding. And it WILL just work
(as evidenced by my disabling of the warning and still getting correct
behavior) as long as the whole system is consistent, regardless of
what encoding is used.

Yes, strings need to distinguish byte/character data. But streams
should not. A stream should accept bytes, and a character string
should always be interpreted as bytes according to the machine's
locale when read/written to a stream, or when incorporated into byte
strings.

> > When I write a basic little perl script that reads in lines from a
> > file, does trivial string operations on them, then prints them back
> > out, there should be absolutely no need for my code to make any
> > special considerations for encoding.
> 
> If none of these trivial string operations depend on the encoding then you
> don't have to use this feature of perl, that's all. Simply make sure that
> the file descriptors are not set to utf8, neither are the strings that you
> concat or match to. etc, so you stay in world of pure bytes.

But it should work even with strings interpreted as characters!
There's no legitimate reason for it not to.

Moreover, the warning is fundamentally stupid because it does not
trigger for characters in the range 128-255, only >255. This is an
implicit assumption that someone would want to use latin1, which is
simply backwards and wrong. A program printing characters in latin1
without associating an encoding with the stream is equally "wrong" to
a program writing arbitrary unicode characters.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to