I have to say I'm slightly confused too for some languages, especially for syllabic alphabets. At the same time, I'm pretty clear for CJK, Syllabaries, and alphabets, or at least I hope I'm clear (I guess I'm about to find out), .chars just returns the right unicode level for whatever the string contents requires.
"abc".chars would return <a b c>, which I'm guessing would be byte size usually. "ææè".chars would return <æãæãè>, which can probably be expressed with UTF8? > Aaron wrote: > Same here, though I have to admit that I'm slow on this whole Unicode > thing, so I'm not sure what you mean by "Unicode chars". For example, > are you expecting to get "f", "f", "i" or "ï" back when you say > "ï".chars? More interestingly, what about all of the Arabic ligatures > which someone who speaks that language might reasonably expect to get > back as multiple "chars", but they have their own Unicode codepoint > (e.g. ï which is "U+FCF3 ARABIC LIGATURE SHADDA WITH DAMMA MEDIAL FORM" > which you might expect to get "ï", "ï" from)? Any Arabic speakers to > confirm or deny this behavior of ligatures? From Apocalyps 5: "Under level 2 Unicode support, a character is assumed to mean a grapheme, that is, a sequence consisting of a base character followed by 0 or more combining characters." Marcus On 4/11/05, Aaron Sherman <[EMAIL PROTECTED]> wrote: > On Mon, 2005-04-11 at 14:12, Ingo Blechschmidt wrote: > > > gcomnz wrote: > > > I'm writing a bunch of examples for perl 6 pleac and it seems rather > > > natural to expect $string.chars to return a list of unicode chars in > > > list context, however I can't find anything to confirm that. (The > > > other alternatives being split and unpack.) > > > > I like that. > > Same here, though I have to admit that I'm slow on this whole Unicode > thing, so I'm not sure what you mean by "Unicode chars". For example, > are you expecting to get "f", "f", "i" or "ï" back when you say > "ï".chars? More interestingly, what about all of the Arabic ligatures > which someone who speaks that language might reasonably expect to get > back as multiple "chars", but they have their own Unicode codepoint > (e.g. ï which is "U+FCF3 ARABIC LIGATURE SHADDA WITH DAMMA MEDIAL FORM" > which you might expect to get "ï", "ï" from)? Any Arabic speakers to > confirm or deny this behavior of ligatures? > > Please be aware, I'm talking about ligatures above, NOT special letters > such as "Ã", which are their own letters, and cannot be decomposed into > "a", "e" without losing information. > > Given Parrot, what happens when you are presented with a Big5 string > that does not have a strict Unicode equivalent? Does .chars throw an > exception, or does it rely on the string to know how to "characterify > itself" according to its vtable? > > -- > Aaron Sherman <[EMAIL PROTECTED]> > Senior Systems Engineer and Toolsmith > "It's the sound of a satellite saying, 'get me down!'" -Shriekback > >