On Thu, Mar 29, 2007 at 12:01:49AM -0400, Rich Felker wrote:
: On Mon, Mar 26, 2007 at 05:28:43PM -0400, SrinTuar wrote:
: > I frequenty run into problems with utf-8 in perl, and I was wondering
: > if anyone else
: > had encountered similar things.
: [...]
:
: Can we get back on-topic with this, and look for solutions to the
: problems? Maybe Larry has some thoughts for us?
Heh, well, the short answer is "Perl 6"... :)
My thoughts are that all languages screw this up in various ways
(including Perl 5), and that my fond hope is that Perl 6 will
allow people to program at the appropriate abstraction level for
each given task. Perl 6 is really designed to be many languages,
not one language, and this is considered to be wonderfulness as long
as the exact pedigree of your language is properly specified from a
universal root. Sort of a URL for languages. Every lexical scope
is written in some language or another. So your desired Unicode
abstraction level is declared lexically, but that really just sets
up the defaults for that dialect. I think no language can succeed
at modern string processing without a type system that knows what it
knows, and more importantly doesn't make assumptions about what it
doesn't know without the approval of the programmer.
Although you can choose to write in some particular dialect of Perl 6,
and dialectical differences are most naturally lexically scoped, the
typology must be dynamic to allow meaningful interchange of data.
Perl 6 has four main Unicode levels it deals with. You can use a
dialect that specializes in bytes, codepoints, language-independent
graphemes, or language-dependent graphemes for a known language.
But a mere dialect cannot and should not overrule the actual typology,
nor should any dialect impose its view on a different dialect.
This is why dialects are lexically scoped.
The actual strings must know whether they are text or binary, and
each string must know which levels of API it is willing to work with.
A binary buffer has some aspects of stringiness but is treated more
like C strings insofar as a buffer type is really just an array of
identical elements. Perl 6 lets you access a byte buffer either as an array
or as a rudimentary string as long as you don't ask it to assume any
semantics it doesn't know typologically. Basically, ASCII is about all
you can assume otherwise.
A text string may have a minimum and maximum abstraction level.
Your string type may well be aware that you have a bunch of Swahili
encoded in UTF-8, so it could choose to let you deal with that string
on the byte level, the codepoint level, the grapheme level, or the
Swahili level. Or it might not choose to allow all those levels. Some
strings might only provide a codepoint or grapheme API for instance.
Perl 6 allows the lower level encoding to be encapsulated, in other
words. You don't have to care whether something is represented in
UTF-whatever as long as the Unicode abstraction is correct at the
level you want to deal with it. The underlying encoding might not
even be a "UTF". All Perl 6 requires is that the semantics be Unicode
semantics. How that gets mapped to the underlying representation
doesn't need to concern to the programmer unless they want it to.
Mostly the programmer just needs to make sure all the portals between
the outside world and the program are typologically safe.
I'm sure I've left out a few important details, but that's the jist of
it. For more on the design and development of Perl 6, some good places
to start are:
http://perlcabal.org/syn
http://www.pugscode.org/
http://www.parrotcode.org/
(If you start playing with the pugs prototype, note that it's
still somewhat hardwired to support just the codepoints dialect.
We're currently concentrating on the meta-object support, and better
Unicode support will follow from that.)
Perl 6 itself is wholeheartedly a Unicode language in the abstract;
one of the big benefits is that there is almost no pressure to overload
existing operators with completely unrelated meanings. Compare with
C++'s (mis)treatment of <<, say. If something looks strange in Perl 6,
it probably *is* strange. And if you program in the APL subset of
Perl 6, expect to get a few strange looks yourself. :-)
And after reading much of the earlier discussion, I must say that,
while I love UTF-8 dearly, it's usually the wrong abstraction level
to be working at for most text-processing jobs. Ordinary people
not steeped in systems programming and C culture just want to think
in graphemes, and so that's what the standard dialect of Perl 6 will
default to. A small nudge will push it into supporting the graphemes
of a particular human language. The people who want to think more
like a computer will also find ways to do that. It's just not the
sweet spot we're aiming for.
Larry
--
Linux-UTF8: i18n of Linux on all levels
Archive: http://mail.nl.linux.org/linux-utf8/