Re: Character sets PDD ready for review

Leopold Toetsch Fri, 14 Mar 2008 18:00:13 -0700

Am Samstag, 8. März 2008 13:59 schrieb Simon Cozens:
> Hi folks,
>       I think I've finished doing what I can with
> docs/pdds/draft/pdd28_character_sets.pod for the time being.
>       Please have a look at it, and let me know if there's anything wrong,
> anything unclear, anything missing or anything objectionable about it.
> Character set and encoding support is an absolute nightmare to get
> right, but I feel the stuff in this PDD gives us a good basis to work from.
>       If there's no major problems with it, I'll pass it on to Allison for
> editing.


1) The Parrot internal character type

«Strings in Parrot's native string format will probably be an array of 
"Parrot_Rune"s.»

or iso-8859-1 or UCS-2.

Why: 

iso-8859-1 is an 1-byte-charset/encoding, where these 256 chars are matching 
unicode U+0000 - U+00FF codepoints. CPAN's BIO::folks and a lot more will 
like to have the speed and memory improvements of an 1-byte-encoding.

UCS-2 is a fixed-width 16-bit charset, which includes the "Basic Multilingual 
Plane" [¹] of unicode. It is sufficient to represent some very high 
percentage of used codepoints. When Wikepedia [²] states ...

<cite>
UCS-2 (2-byte Universal Character Set) is an obsolete character encoding which 
is a predecessor to UTF-16.
</cite>

..., it's already mixing the concepts of charset and encoding. Anyway for 
efficiency reasons, I'd like to see this as an alternative.

2) the concept of Parrot_Rune or

<cite>
Unicode codepoint where values >= 0x80000000 are
       understood to be entries into the global "Parrot_grapheme_table" array.
</cite>

seems to be implying that we are gonna starting to:

a) rewrite / improve the now used ICU library
b) inventing a new "standard"
c) will do a lot of future hiring work to keep in sync with unicode folks ;-)

Basically I have some concerns "who will implement and maintain it".

I wrote the one and only (AFAIK) test showing the ugliness of decomposed 
unicode [4] codepoints and I'd be glad if there would be a better solution. 

OTOH I don't know the impact of not having it. East European or other maybe 
involved folks should speak up now.

> Simon

leo's 2¢

[1] http://en.wikipedia.org/wiki/Basic_Multilingual_Plane
[2] http://en.wikipedia.org/wiki/UTF-16
[3] [EMAIL PROTECTED]:~/svn/parrot/leo> find t -name '*.t' | xargs grep -w 
compose
t/op/string_cs.t:    compose S1, S1
t/pmc/object-mro.t:# ... now some tests which fail to compose the class
[4] [EMAIL PROTECTED]:~/svn/parrot/leo> ./parrot t/op/string_cs_46.pasm
___ǰ___
7 8 8 7

Re: Character sets PDD ready for review

Reply via email to