Unicode thoughts...

Jeff Sun, 24 Mar 2002 22:37:04 -0800

This will likely open yet another can of worms, but Unicode has been
delayed for too long, I think. It's time to add the Unicode libraries
(In our case, the ICU libraries at <http://oss.software.ibm.com/icu/>,
which Larry has now blessed) to Parrot. string.c already has (admittedly
unavoidable, due to the library not being included) assumptions such as
isdigit(). So, I have a few thoughts (that may have already been shot
down by people wiser than I in such matters) to explicate, and some
questions to ask.


ICU should be added as a note in the README, and maybe to 'INSTALL' if
we ever create one. Let's not add it to CVS, as it's not under our
control. If we have to patch ICU to make it work correctly with Parrot,
the patches should be submitted back to the ICU team. And I'm joining
the appropriate mailing lists to keep appraised of development.

Before Unicode goes into full swing, I need some idea of how we're going
to deploy the libraries. On this note, I defer to the Configure master,
Brent. I've already done some work with ICU, so I'm reasonably
comfortable with migrating in one Unicode bit at a time, until we're
ready for full UTF-16 compliance.

The RE engine should (I'm speaking without having recently read the
source, so feel free to correct me) not need to be migrated, as it's
already using UTF-32 internally, which leaves just the string internals.
These can be migrated to using ICU macros fairly easily (I've already
done some of the work locally), so I think the main focus should be on
encodings, as we'll have to eventually support the more common
wide-character encodings such as KOI-8 and BIG5.

I still have some questions about using UTF-16 internally for string
representation (as mentioned in
<http:[EMAIL PROTECTED]/msg07856.html>),
but I've resolved most of those. It's an excellent match for the ICU
library, as it uses UTF-16 internally. My only question is if we're
going to incur a performance hit every time a scalar is transferred to
the RE engine, as it uses UTF-32 internally.

Also, once we have UTF-16 running internally, I'd be interested in
seeing what memory consumption looks like vs. UTF-32, beause I'd like to
see if it makes sense to add a compile-time switch between UTF-8 and
UTF-32 to let the installer decide on memory tradeoffs. ICU has an
internal macro that defines its own internal representation, and that
could conflict with our intended usage as well.

Performance would suffer in the UTF-8 case, naturally, but the
difference in memory usage might be significant enough that we'd want to
leave the decision up to the installer. Having said that, the headache
of testing multiple versions of Perl6 might not be worth it.

So, to wrap up, I'm soliciting thoughts on how best to start the Unicode
migration, and deal with the inevitable problems that will come up. I'm
hoping that most of the magic will be hidden in string.c, where we won't
have to worry about it, but we'll have to see.

Now, this is admittedly being composed at 2:00 A.M, so my thoughts may
not be the most coherent, and for that I apologize. Most of my concern
stems from how best to add build steps to the various platforms without
ending up with a completely broken Parrot for weeks and developers
screaming about "What the *HELL* is this error? Where is this library?
<brane explodes>". If these issues have already been beaten to death and
we've moved on to more interesting issues, of course I'll be interested
there as well.

Waiting for the firestorm,
--
Der Parrot Kommisar, Jeff <[EMAIL PROTECTED]>

Unicode thoughts...

Reply via email to