And after reading much of the earlier discussion, I must say that,
while I love UTF-8 dearly, it's usually the wrong abstraction level
to be working at for most text-processing jobs.  Ordinary people
not steeped in systems programming and C culture just want to think
in graphemes, and so that's what the standard dialect of Perl 6 will
default to.  A small nudge will push it into supporting the graphemes
of a particular human language.  The people who want to think more
like a computer will also find ways to do that.  It's just not the
sweet spot we're aiming for.

Very interesting, though I must admit I'm sad to hear that. Over the years
I have come to find that what I see as the sweet spot for string processing
would be this definition of a utf-8 "string":

   "a null terminated series of bytes, some of which are parts of
valid utf-8 sequences, and others which are treated as individial
binary values"

Effectively, its a bare minimum update of what we had with ascii. The
only time I want to depart from this paradigm is when I have to. But
in general, I want to avoid conversion and  keep my strings in this
format as much as possible. (this is much like the way tools such as
readline, vim, curses, etc handle utf-8 strings, etc )

Most code should not have to care which parts are valid and which are
not. If they call a function which requires a specific level of
validation, that function should be free to complain when that is not
the case. But I don't see a reason why a plain "print" should ever
need to care or complain about what its printing. All it has to do is
catenate and dump bytes out, I don't think it should be a bouncer of
what is kosher or not for printing.

I think a regex engine should, for example, match one binary byte to a
"." the same way it would match a valid sequence of unicode characters
and composing characters as a singe grapheme. This is a best effort to
work with the string as provided, and someone who does not want such
behavior would not run regex's over such strings.

When a program needs to take in data from various different encodings,
it should be their job to convert that data into their locale's native
encoding. (by reading mime headers or whatever mechanism) I don't
think a programming language should have built-ins that track the
status of a string- as that strikes me as an attempt to DWIM and not
DWIS.

Taking that trend to its logical conclusion, I would not want every
scalar value to track every possible kind of validation that has
happned to a string: "utf-8, validated NFD, turkish + korean". If
someone wants to do language specific case folding they can either
default to the locale's language+encoding, or else specify which one's
they want to use. If someone wants to make sure their string is valid
utf-8 in NFKC, they can pass it to a validation routine such as
Unicode::Normalize::NFKC. But the input and output of that routine
should be a plain old scalar, with no special knowledge of what has
happened to it.

This minimal approach is much like what happens in C/C++, and i don't
see any reason why a scripting language should do more than it is
asked to and in the process potintially do the wrong thing despite its
best intentions. Admittedly, in perl 5 these are trivial annoyances
with readily available workarounds. From your post I guess I can
assume that perl 6 will be similar.


On a separate topic:
Java seems to have a much worse problem. Forcing conversion to utf-16
causes you to lose information, since utf-16 cannot represent all the
possible invalid utf-8 sequences. It forces you treat your strings as
binary blobs and lose access to all the functions that operate on
strings, and/or take a performance hit for conversion where none is
actually needed. (If the design goal of Java was to force utf-16 on
the world they are unlikely to succeed at it, as utf-8 has largely
ursurped it's place)

--
Linux-UTF8:   i18n of Linux on all levels
Archive:      http://mail.nl.linux.org/linux-utf8/

Reply via email to