On Mon, 2005-04-11 at 15:40, gcomnz wrote:
> I have to say I'm slightly confused too for some languages,
> especiallyfor syllabic alphabets. At the same time, I'm pretty clear
> for CJK,Syllabaries, and alphabets, or at least I hope I'm clear (I
> guess I'mabout to find out), .chars just returns the right unicode
> level forwhatever the string contents requires.
> "abc".chars would return <a b c>, which I'm guessing would be
> bytesize usually.
Fair enough.
> "ææè".chars would return <æãæãè>, which can probably be
> expressed with
> UTF8?
I think you're confusing UTF8 (which can represent ALL Unicode
characters) and "the UTF8 subset which consists of one-byte
representations" (which happens to overlap with 7-bit ASCII).
> >From Apocalyps 5: "Under level 2 Unicode support, a character
> isassumed to mean a grapheme, that is, a sequence consisting of a
> basecharacter followed by 0 or more combining characters."
> Marcus
Hmmm... that doesn't answer the ligature question clearly though. That
answers for the case of combining diacritical marks:
http://en.wikipedia.org/wiki/Combining_diacritical_mark
e.g. <A Ì> vs "Ã", which is a pre-combined example, but there are (as I
understand it), many valid examples which do not have a pre-combined
representation in Unicode.
But not for ligatures:
http://en.wikipedia.org/wiki/Ligature_%28typography%29
which are, by definition, actually two or more unique characters which
have a special typographical representation when adjacent. So, they are
a single grapheme, but like I said: certain cultures would be shocked by
a .chars that did not decompose their ligatures (and again, I'm mostly
thinking Arabic, so I'd defer to someone who actually spoke Arabic and
knows how they deal with this).