On Thu, 11 Nov 2010 23:17:05 +0000 (UTC)
retard <[email protected]> wrote:
> Thu, 11 Nov 2010 23:59:36 +0100, spir wrote:
>
> > (3) most texts we deal with
> > today only hold common characters that have a single-code
> > representation. So that everybody plays with strings as if (1 code <-->
> > 1 char).
>
> That might be true for many americans. But even then the single byte
> can't express all characters you need in everyday communication. There
> are countless people with é or ë or ü in their last name. ” and “ are
> probably not among the first 128-256 codes. Using e instead of ë or é
> might work to some extent, but ü and u are pronounced differently. Some
> use ue instead.
I meant _codes_ (code points). Not code _unit_ and even less bytes.
The character <I with dot above and dot below> (if ever you want to use it ;-)
needs 2 or 3 code _points_ for representation in memory or storage. Try:
writeln (""); // --> Ị̇
If your output system is sufficiently capable, then you get an I with dot above
and dot below! (I recommand the DejaVu font series). And, as you see, the type
dstring is used, meaning each element is a dchar holding a whole code point.
Right? But it's a single character requiring 3 codes.
Ebven more troubling: if I choose a lowercase 'i' instead, then since <i with
dot below> exists as a precombined code, I have the choice between 2 or 3 codes.
An "abstract character", as introduced by UCS and represented by a code, is
*not* what we think as "character". It is an abstract "mark", such as the 'I',
the combining dot above, the combining dot below, all inside "Ị̇".
Also, it's important to realise that there is no formal definition of
"character", and even less a universal one. A character is what people using a
scripting system consider as such.
I know, UCS / Unicode terminology is misleading. It does not help, instead it
increases confusion.
What you are evoking is a lower-level issue, namely the encoding of code points
themselves (here, 3) into code units, and then bytes, in a concrete form (say,
in file). Depending on the encoding (here I consider only utf8/16/32 ones),
there may be 1, 1 or 2, 1 to 4, code units per code point.
Denis
-- -- -- -- -- -- --
vit esse estrany ☣
spir.wikidot.com