On Thu, 11 Nov 2010 22:40:47 +0100
Daniel Gibson <[email protected]> wrote:

> spir schrieb:
> > On Thu, 11 Nov 2010 09:40:05 -0800
> > Andrei Alexandrescu <[email protected]> wrote:
> > 
> >>> string substring(string s, size_t beg, size_t end) // "logical slice" -
> >>> from code point number beg to code point number end  
> >> That's not implemented and I don't think it would be useful. Usually 
> >> when I want a substring, the calculations up to that point indicate the 
> >> code _unit_ I'm at.
> > 
> > Yes, but a code unit does not represent a character, instead a unicode 
> > "abstract character".
> > 
> > void main() {
> >     dstring s = "\u0061\u0302\u006d\u0065"d;
> >     writeln(s);     // "âme"
> >     assert(s[0..1] == "a");
> >     assert(s.indexOf("â") == -1);
> > }
> > 
> > A "user-perceived character" (also strangely called "grapheme" in unicode 
> > docs) can be represented by an arbitrary number of code _units_ (up to 8 in 
> > their test data, but there is no actual limit). What a code unit represents 
> > is, say, a "scripting mark". In "â", there are 2 of them. For legacy 
> > reasons, UCS also includes "precombined characters", so that "â" can also 
> > be represented by a single code, indeed. But the above form is valid, it's 
> > even arguably the base form for "â" (and most composite chars cannot be 
> > represented by a single code).
> > 
> 
> OMG, this is worse than I thought O_O
> I thought "ok, for UTF-8 one code unit is one byte and one 'real', visible 
> character is called a code point and consists of 1-4 code units" - but having 
> "user-perceived characters" that consist of multiple code units is sick.

Most people, even programmers that deal with unicode everyday, think the same. 
This is due to several factors: (1) unicode's misleading use of "abstract 
character" (wonder whether it was done in purpose?) (2) string processing tools 
simply ignore all of that (3) most texts we deal with today only hold common 
characters that have a single-code representation.
So that everybody plays with strings as if (1 code <--> 1 char).

> Unicode has a way to tell if a sequence of code units (bytes) belongs 
> together 
> or not, so identifying code points isn't too hard.
> But is there a way to identify "graphemes"? Other then a list of rules like 
> "a 
> sequence of the two code points <foo> and <bar> make up one "grapheme" 
> <foobar>?
 
There is an algorithm, indeed, and not too complicated. But you won't find any 
information in the string of codes itself that tells you about it (meaning, you 
cannot synchronise at start/end of "grapheme" without knowledge of the whole 
algorithm).
Accordingly, when picking _some_ code points (eg the one 'a' above), there is 
no way to tell whether it's a standalone code that happens to represent a whole 
character ("a"), or the just the start of it. These are the base "marks": they 
have the same code when meaning a whole char and as start of "stack" (substring 
representing a whole char). But combining marks have 2 codes: one when 
combined, one when used alone like in "in portuguese, '~' is used to denote a 
nasal vowel".
(Hope I'm clear.)

The whole set of UCS (the charset) issues, over Unicode ones, is imo:
1. Actual characters are represented by an arbitrary number of codes.
2. The same character can be represented by different strings of codes...
3. including strings of the same length, but in different order ;-)
The first issue is actually good: it would be stupid to try and give a code to 
every possible combination, and even impossible. also, the present scheme 
allows _creating_ character for our use, that will be rndered correctly (yes!). 
(But I would kill any designer collegue that would allow for points 2. and 3. 
;-)

Denis
-- -- -- -- -- -- --
vit esse estrany ☣

spir.wikidot.com

Reply via email to