I have to say I'm slightly confused too for some languages, especially
for syllabic alphabets. At the same time, I'm pretty clear for CJK,
Syllabaries,  and alphabets, or at least I hope I'm clear (I guess I'm
about to find out), .chars just returns the right unicode level for
whatever the string contents requires.

"abc".chars  would return <a b c>, which I'm guessing would be byte
size usually.

"ææè".chars would return <æãæãè>, which can probably be 
expressed with UTF8?

> Aaron wrote:
> Same here, though I have to admit that I'm slow on this whole Unicode
> thing, so I'm not sure what you mean by "Unicode chars". For example,
> are you expecting to get "f", "f", "i" or "ï" back when you say
> "ï".chars? More interestingly, what about all of the Arabic ligatures
> which someone who speaks that language might reasonably expect to get
> back as multiple "chars", but they have their own Unicode codepoint
> (e.g. ï which is "U+FCF3 ARABIC LIGATURE SHADDA WITH DAMMA MEDIAL FORM"
> which you might expect to get "ï", "ï" from)? Any Arabic speakers to
> confirm or deny this behavior of ligatures?

From Apocalyps 5: "Under level 2 Unicode support, a character is
assumed to mean a grapheme, that is, a sequence consisting of a base
character followed by 0 or more combining characters."

Marcus

On 4/11/05, Aaron Sherman <[EMAIL PROTECTED]> wrote:
> On Mon, 2005-04-11 at 14:12, Ingo Blechschmidt wrote:
> 
> > gcomnz wrote:
> > > I'm writing a bunch of examples for perl 6 pleac and it seems rather
> > > natural to expect $string.chars to return a list of unicode chars in
> > > list context, however I can't find anything to confirm that. (The
> > > other alternatives being split and unpack.)
> >
> > I like that.
> 
> Same here, though I have to admit that I'm slow on this whole Unicode
> thing, so I'm not sure what you mean by "Unicode chars". For example,
> are you expecting to get "f", "f", "i" or "ï" back when you say
> "ï".chars? More interestingly, what about all of the Arabic ligatures
> which someone who speaks that language might reasonably expect to get
> back as multiple "chars", but they have their own Unicode codepoint
> (e.g. ï which is "U+FCF3 ARABIC LIGATURE SHADDA WITH DAMMA MEDIAL FORM"
> which you might expect to get "ï", "ï" from)? Any Arabic speakers to
> confirm or deny this behavior of ligatures?
> 
> Please be aware, I'm talking about ligatures above, NOT special letters
> such as "Ã", which are their own letters, and cannot be decomposed into
> "a", "e" without losing information.
> 
> Given Parrot, what happens when you are presented with a Big5 string
> that does not have a strict Unicode equivalent? Does .chars throw an
> exception, or does it rely on the string to know how to "characterify
> itself" according to its vtable?
> 
> --
> Aaron Sherman <[EMAIL PROTECTED]>
> Senior Systems Engineer and Toolsmith
> "It's the sound of a satellite saying, 'get me down!'" -Shriekback
> 
>

Reply via email to