Re: perl unicode support

Rich Felker Thu, 29 Mar 2007 12:13:02 -0800

On Thu, Mar 29, 2007 at 07:43:54PM +0200, Egmont Koblinger wrote:
> On Thu, Mar 29, 2007 at 01:05:57PM -0400, Rich Felker wrote:
> 
> > Gtk+-2’s approach is horribly incorrect and broken. By default it
> > writes UTF-8 filenames into the filesystem even if UTF-8 is not the
> > user’s encoding. 
> 
> There's an environment variable that tells Gtk+-2 to use legacy encoding in
> filenames. Whether or not forcing UTF-8 on filenames is a good idea is
> really questionable, you're right.


Well the real solution is forcing UTF-8 in filenames by forcing
everyone who wants to use multilingual text to switch to UTF-8
locales.

> But I'm not just talking about filenames, there are many more strings
> handled inside Glib/Gtk+. Strings coming from gettext that will be displayed
> on the screen, error messages originating from libc's strerror, strings
> typed by the user into entry widgets and so on. Gtk+-2 uses UTF-8
> everywhere, and (except for the filenames) it's clearly a wise decision.

Not if it will also be reading/writing text to stdout or text-based
config files, etc..

> I think this is just plain wrong. Since when do you browse the net and read
> acccented pages? Since when do you use UTF-8 locale?

Using accented characters in your own language has always been
possible with legacy codepage locales, and is still possible with what
I consider the correct implementation. The only thing that's not
possible in legacy codepage locales is handling text from other
languages that need characters not present in your codepage.

> I used Linux with a Latin-2 locale since 1996. It's been around 2003 that I
> began using UTF-8 sometimes and it was last year that I finally managed to
> switch fully to UTF-8. There are still several applications that are
> nightmare with UTF-8 (midnight commander for example). A few years ago
> software were even much worse, many of them were not ready for UTF-8, it
> would have been nearly impossible to switch to UTF-8.

But now we’re living in 2007, not 2003 or 1996. Maybe your approaches
had some merit then, but that’s no reason to continue to use them now.
At this point anyone who wants multilingual text support should be
using UTF-8 natively, and if they have a good reason they’re not (e.g.
a particular piece of broken software) that software should be quickly
fixed.

> When did you switch to
> unicode? Probably a few years earlier than I did, but I bet you also had
> those old-fashioned 8-bit days...

I’ve always used UTF-8 since I started with Linux; until recently it
was just restricted to the first 128 characters of Unicode, though. :)
I never used 8bit codepages except to draw stuff on DOS waaaaay back.

> So, I have used Linux for 10 years with an 8-bit locale set up. Still I
> could visit French, Japanese etc. pages and the letters appeared correctly.

UTF-8 has been around for almost 15 years now, longer than any real
character-aware 8bit locale support on Linux. It was a mistake that
8bit locales were ever implemented on Linux. If things had been done
right from the beginning we wouldn't even be having this discussion.

I’m sure you did have legitimate reasons to use Latin-2 when you did,
namely broken software without proper support for UTF-8. Here’s where
we have to agree to disagree I think: you’re in favor of workarounds
which get quick results while increasing the long-term maintainence
cost and corner-case usability, while I’m in favor of omitting
functionality (even very desirable functions) until someone does it
right, with the goal of increasing the incentive for someone to do it
right.

> Believe me, I would have switched to Windows or whatever if Linux browsers
> weren't be able to perform this pretty simple job.

Your loss, not mine.

> It's not about workarounds or non-issues. If a remote server tells my
> browser to display a kanji then my browser _must_ display a kanji, even if

Nonsense. If you don’t have kanji fonts installed then it can’t
display kanji anyway. Not having a compatible encoding is a comparable
obstacle to not having fonts. I see no reason that a system without
support for _doing_ anything with Japanese text should be able to
display it. What happens if you copy and paste it from your browser
into a terminal or text editor???

Even the Unicode standards talk about “supported subset” and give
official blessing to displaying characters outside the supported
subset as a ? or replacement glyph or whatever.

> > > Show me your code that you think "just works" and I'll show you where 
> > > you're
> > > wrong. :-)
> > 
> > Mutt is an excellent example.
> 
> As you might see from the header of my messages, I'm using Mutt too. In this
> regard mutt is a nice piece of software that handles accented characters
> correctly (nearly) always. In order to do this, it has to be aware of the
> charset of messages (and its parts) and the charset of the terminal and has
> to convert between them plenty of times. The fact it does its job (mostly)
> correctly implies that the authors didn't just write "blindly copy the bytes
> from the message to the terminal" kind of functions, they have taken charset
> issues into account and converted the strings whenever necessary. From a
> user's point of view, accent handling in Mutt "just works". This is because
> the developers took care of it.

Mutt “just works” in exactly the sense I described. I've RTFS'd mutt
and studied it a fair bit: it converts all data to your locale charset
(or an overridable configured charset, but that could cause problems).
This is absolutely necessary since it wants to be able to use external
editors and viewers which require data to be in the system's encoding,
use the system regex routines, etc. All of this is what makes mutt
light, clean, and a good citizen among other unix apps.

> If the developers had tought "copying those
> bytes from the mail to the terminal" would "_just work_" then mutt would be
> an unusable mess.

I have never suggested doing something idiotic like that yet you keep
bringing it up again and again as if I did. “Just work” means using
the C/POSIX multibyte and iconv and regex, etc. functions the way
they’re intended to be used and treating all text as text (in the
LC_CTYPE sense) once it’s been read in from local or foreign sources
(with natural conversions required for the latter).

Rich

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Re: perl unicode support

Reply via email to