Re: ICU incorporation and string changes heads-up

Jarkko Hietaniemi Sat, 10 Apr 2004 04:30:24 -0700

> Ok. Now when the identical string "i" (but originating from different
> locale environmets) goes through a sequence of string operations later,
> how do you track the locale down to the final C<uc> where it's needed?
> 
> e.g.
> 
>     use German;
>     my $gi = "i";
>     use Turkish;
>     my $ti = "i";


$gi and $ti contain the same Unicode code points, in this case 0x69.

>     my $s = $gi x 10;
>     ...
>     print uc($s);       # locale is what?

Locale is what *you* said the level 3 locale should be.  If it's not
set, it's probably according to the Unicode default casing rules, which
are language-neutral.

> Where do you track the locale, if not in the string itself.

You don't track it.  It's lexical, a policy in that code block.

>>Hmm? The point is that if you have a list of strings, for instance some
>>in English, some in Greek, and some in Japanese, and you want to sort
>>them, then you have to pick a sort ordering.
> 
> 
> Ok. I want to uppercase the strings - no sorting (yet). I've an array of
> Vienna's Kebab boothes. Half of these have turkish names (at least) the

Mmmm, kebab.

> rest is a mixture of other languages. I'd like to uppercase this array
> of names. How do I do it?

You pick a locale and you say uc().

You can't have *BOTH* Turkish and German casing rules in effect at the
same time.  Well, sometimes you might get away with mixing policies, but
in the general case it cannot work (or make sense: casing is meaningless
for many Asian scripts, or be devilishly complex: "Japanese" mixes
several different "scripts" and "languages").  Take www.yahoo.co.jp:
what "language" are the "Yahoo!" strings in?

Let's throw in some more: Vienna beer houses with German names, Vienna
cafes with German names, Vienna cafes with French names, Vienna kebab
houses with Turkish names, Vienna Chinese restaurants, and Vienna Thai
restaurants.  Now you want to sort them.  Are you going to implement 6x5
or 30 sorting algorithms?

> OTOH normalizing all strings on input is not possible - what if they
> should go into a file in unnormalized form.

Please study the ACR-CCS-CEF-CES mantra.  You say "unnormalized form"
without specifying what form you mean.  If you e.g really want the bytes
of the serialized input file/stream (a CES), mark your PIO stream as
"bytes" and read it in, and then you can operate it at level zero.

In PASM, we need a way to say:

string_level_0
string_level_1
string_level_2
string_level_3(locale)

The string_level2 *might* have an argument of which Unicode
normalization scheme should be picked, or we might just punt and pick
one as the default.

Re: ICU incorporation and string changes heads-up

Reply via email to