Am Samstag, 8. März 2008 13:59 schrieb Simon Cozens:
> Hi folks,
> I think I've finished doing what I can with
> docs/pdds/draft/pdd28_character_sets.pod for the time being.
> Please have a look at it, and let me know if there's anything wrong,
> anything unclear, anything missing or anything objectionable about it.
> Character set and encoding support is an absolute nightmare to get
> right, but I feel the stuff in this PDD gives us a good basis to work from.
> If there's no major problems with it, I'll pass it on to Allison for
> editing.
1) The Parrot internal character type
«Strings in Parrot's native string format will probably be an array of
"Parrot_Rune"s.»
or iso-8859-1 or UCS-2.
Why:
iso-8859-1 is an 1-byte-charset/encoding, where these 256 chars are matching
unicode U+0000 - U+00FF codepoints. CPAN's BIO::folks and a lot more will
like to have the speed and memory improvements of an 1-byte-encoding.
UCS-2 is a fixed-width 16-bit charset, which includes the "Basic Multilingual
Plane" [¹] of unicode. It is sufficient to represent some very high
percentage of used codepoints. When Wikepedia [²] states ...
<cite>
UCS-2 (2-byte Universal Character Set) is an obsolete character encoding which
is a predecessor to UTF-16.
</cite>
..., it's already mixing the concepts of charset and encoding. Anyway for
efficiency reasons, I'd like to see this as an alternative.
2) the concept of Parrot_Rune or
<cite>
Unicode codepoint where values >= 0x80000000 are
understood to be entries into the global "Parrot_grapheme_table" array.
</cite>
seems to be implying that we are gonna starting to:
a) rewrite / improve the now used ICU library
b) inventing a new "standard"
c) will do a lot of future hiring work to keep in sync with unicode folks ;-)
Basically I have some concerns "who will implement and maintain it".
I wrote the one and only (AFAIK) test showing the ugliness of decomposed
unicode [4] codepoints and I'd be glad if there would be a better solution.
OTOH I don't know the impact of not having it. East European or other maybe
involved folks should speak up now.
> Simon
leo's 2¢
[1] http://en.wikipedia.org/wiki/Basic_Multilingual_Plane
[2] http://en.wikipedia.org/wiki/UTF-16
[3] [EMAIL PROTECTED]:~/svn/parrot/leo> find t -name '*.t' | xargs grep -w
compose
t/op/string_cs.t: compose S1, S1
t/pmc/object-mro.t:# ... now some tests which fail to compose the class
[4] [EMAIL PROTECTED]:~/svn/parrot/leo> ./parrot t/op/string_cs_46.pasm
___ǰ___
7 8 8 7