> Ok. Now when the identical string "i" (but originating from different > locale environmets) goes through a sequence of string operations later, > how do you track the locale down to the final C<uc> where it's needed? > > e.g. > > use German; > my $gi = "i"; > use Turkish; > my $ti = "i";
$gi and $ti contain the same Unicode code points, in this case 0x69. > my $s = $gi x 10; > ... > print uc($s); # locale is what? Locale is what *you* said the level 3 locale should be. If it's not set, it's probably according to the Unicode default casing rules, which are language-neutral. > Where do you track the locale, if not in the string itself. You don't track it. It's lexical, a policy in that code block. >>Hmm? The point is that if you have a list of strings, for instance some >>in English, some in Greek, and some in Japanese, and you want to sort >>them, then you have to pick a sort ordering. > > > Ok. I want to uppercase the strings - no sorting (yet). I've an array of > Vienna's Kebab boothes. Half of these have turkish names (at least) the Mmmm, kebab. > rest is a mixture of other languages. I'd like to uppercase this array > of names. How do I do it? You pick a locale and you say uc(). You can't have *BOTH* Turkish and German casing rules in effect at the same time. Well, sometimes you might get away with mixing policies, but in the general case it cannot work (or make sense: casing is meaningless for many Asian scripts, or be devilishly complex: "Japanese" mixes several different "scripts" and "languages"). Take www.yahoo.co.jp: what "language" are the "Yahoo!" strings in? Let's throw in some more: Vienna beer houses with German names, Vienna cafes with German names, Vienna cafes with French names, Vienna kebab houses with Turkish names, Vienna Chinese restaurants, and Vienna Thai restaurants. Now you want to sort them. Are you going to implement 6x5 or 30 sorting algorithms? > OTOH normalizing all strings on input is not possible - what if they > should go into a file in unnormalized form. Please study the ACR-CCS-CEF-CES mantra. You say "unnormalized form" without specifying what form you mean. If you e.g really want the bytes of the serialized input file/stream (a CES), mark your PIO stream as "bytes" and read it in, and then you can operate it at level zero. In PASM, we need a way to say: string_level_0 string_level_1 string_level_2 string_level_3(locale) The string_level2 *might* have an argument of which Unicode normalization scheme should be picked, or we might just punt and pick one as the default.