Re: perl unicode support

Rich Felker Thu, 29 Mar 2007 12:34:12 -0800

On Thu, Mar 29, 2007 at 07:15:37PM +0200, Egmont Koblinger wrote:
> > or failing that ask the programmer to explicitly qualify them as one of
> > its supported encodings. I do not think the strings should have built in
> > machinery that does this work behind the scenes implicitly.
> 
> If you have the freedom of choosing the character set you use, you need to


You don’t. An application should assume that there is no such freedom;
the character encoding is dictated by the user or the host
implementation, and should on all modern systems be UTF-8 (but don’t
assume this).

Any text that’s encoded with another scheme needs to be treated as
non-text (binary) data (i.e. not suitable for use with regex). It
could be converted (to the dictated encoding) or left as binary data
depending on the application.

> tell the regexp matching function what charset you use. (It's a reasonable
> decision that the default is the charset of the current locale, but it has
> to be overridable.) There are basically two ways I think to reach this goal.

You can get by just fine without it being overridable. For instance,
mutt does just fine using the POSIX regex routines which do not have
any way of specifying a character encoding.

> 1st: strings are just byte sequences, and you may pass the charset
> information as external data.
> 
> 2nd: strings are either forced to a fixed encoding (UTF-8 in Gtk+, UCS-16 in
> Java) or carry meta-information about their encoding (utf8 flag in Perl).

Of these (neither of which is necessary), #1 is the more unix-like and
#2 is the mac/windows approach. Unix has a strong history of
intentionally NOT assigning types to data files etc., but instead
treating everything as streams of bytes. This leads to very powerful
combinations of tools where the same (byte sequence) of data is
interpreted in different ways by different tools/contexts. I am a
mathematician and I must say it’s comparable to what we do when we
allow ourselves to think of an operator on a linear space both as a
map between linear spaces and as an element of a larger linear space
of operators (and possibly also in many other ways) at the same time.

On the other hand, DOS/Windows/Mac have a strong history of assigning
fixed types to data files. On DOS/Windows it’s mostly just extensions,
but Mac goes much farther with the ‘resource fork’, not only typing
the file but also associating it with a creating application. This
sort of mechanism is, in my opinion, deceptively convenient to
ignorant new users, but also fosters an unsophisticated, uneducated,
less powerful way of thinking about data.

Of course in either case there are ways to override things and get
around the limitations. Even on unix files tend to have suffixes to
identify the ‘type’ a user will most often want to consider the file
as, and likewise on Mac you can edit the resource forks or ignore
them. Still, I think the approach you take says a lot about your
system philosophy.

> Using the 1st approach I still can't see how you'd imagine Perl to work.
> Let's go back to my earlier example. Suppose perl read's a file's content
> into a variable. This file contained 4 bytes, namely: 65 195 129 66. Then
> you do this:
> 
>       print "Hooray\n" if $filecontents =~ m/A.B/;
> 
> Should it print Hooray or not if you run this program under an UTF-8 locale?

Of course.

> On one hand, when running with a Latin1 locale it didn't print it. So it
> mustn't print Hooray otherwise you brake backwards compatibility.

No, the program still does the same thing if run in a Latin-1 locale,
regardless of your perl version. There’s no reason to believe that
text processing code should behave byte-identically under different
locales.

> On the other hand, we just encoded the string "AÁB" in UTF-8 since nowadays
> we use UTF-8 everywhere, and of course everyone expects AÁB to match A.B.

So you need to make your data and your locale consistent. If you want
to set the locale to UTF-8, the string “AÁB” needs to be in UTF-8. If
you want to use the legacy Latin-1 data, your locale needs to be set
to something Latin-1-based.

> How would you design Perl's Unicode support to overcome this contradiction?

I don’t see it as any contradiction. The code does exactly what it’s
supposed to in either case, as long as your locale and data are
consistent.

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: perl unicode support

Reply via email to