Re: [DUG] Upgrading to XE - Unicode strings questions

Stefan Mueller Tue, 23 Nov 2010 07:29:10 -0800

John,

I think you are confusing Canonical & Normalized versions of the same Unicode 
string (in the example s1 is canonical, s2 is normalized) and the effect of 
local codepage conversion.


Windows-1252 codepage (latin ISO 8859-1) has support for characters like the 
"ö" (ascii code #246) and "é" (ascii code #130). Converting to 
ansistring/ansichar on your system will take care of canonical Unicode 
representation and hence return true if you compare those strings. Please note 
that this only works because your system is set to a latin based codepage ... 
do the same on a Japanese version of windows and you'll get a very different 
result as there is no support for "ö" in ansistring under Japanese codepage! 
Because your system is Latin your first testcase/example of you building the 
word "finance" should actually work without problems - Joylon/Cary are probably 
wrong if they indeed implied that this wouldn't work.

The "ö" can be written as a compound #$006F + #$0308 in canonical format ... 
and as #$00f6 in the "normalized" format. For most normal applications it just 
doesn't really matter either way because a user that is inputting text under 
his local codepage will always do it the same way and hence chances of you 
encountering a mix between canonical/normalized version will be close to zero. 
You only ever get issues if you cross codepage boundaries (like for example if 
you have users in different countries storing data in a database - which is why 
international databases often use UTF-8 to store data instead of their native 
charactersets). Most of the better databases (like for example Oracle) have 
built in support for sorting and handling canonical format and do the 
conversion automatically for you  ... for someone writing desktop applications 
it usually just isn't an issue either way. 


Kind Regards,
Stefan Mueller 
_______________________
R&D Manager
ORCL Toolbox LLP, Japan
http://www.orcl-toolbox.com 
 


-----Original Message-----
From: delphi-boun...@delphi.org.nz [mailto:delphi-boun...@delphi.org.nz] On 
Behalf Of John Bird
Sent: Tuesday, November 23, 2010 7:33 PM
To: NZ Borland Developers Group - Delphi List
Subject: Re: [DUG] Upgrading to XE - Unicode strings questions

Iterating over a string is for the purpose of doing something with each 
individual character......whether it is a ‘A’   or a 'A' with a ^ (caret) on 
top of it.   When I said the number of bytes in a character varies I was not 
meaning the number of bytes in a Char - I was meaning the total number of 
bytes in a one resulting character or letter might vary.   For instance the 
word fiancee  (with an acute on the last e) has 7 characters, the last of which 
might be 2 code units

When I iterate over a string I ideally want to get one character in the word 
each time:

could I build a string like this?

setlength(String1,7);
string1[1] := 'f';
string1[2] := 'i';
string1[3] := 'a';
string1[4] := 'n';
string1[5] := 'c';
string1[6] := 'e';
string1[7] := 'e';            //I would want the full e acute here

hence I want to be able to go

    for i :=1 to length(string1) do
    begin
            thisChar:=string1[i];        //get each character one at a time
            listbox1.items.add('i=' + inttostr(i)+'  character at position i = 
' +ThisChar;
    end

I would be expecting to see 7 characters, 7 lines in the list box, and 
length=7,  with the last being e acute.
Now everything Jolyon  are saying and Cary also implies that this is not 
going to work.   This looks to be a real nuisance!

Now I think the e acute could be one unicode character (as there is likely to 
be a representation using one character, one code point and one code
unit) or as one character, two code units, 2*2 bytes - a surrogate pair - 
where eg one supplies the e and one the acute.   So it looks like what I see 
might vary according to how the e acute is encoded in the string?

As I read further this gets murkier, as some of the things Cary Jensen says are 
not the same as what you say even if you say it emphatically!

This is why I am thinking we have to understand clearly Unicode, and the 
Windows implementation of it.....and I don't really yet.

Here is what Cary Jensen says about a similar example with 7 characters, one of 
which is a surrogate pair:

"
Although there are 7 characters in the printed string, the UnicodeString 
contains 8 code units, as returned by the Length function. Inspection of the 
6th and 7th elements of the UnicodeString reveal the high and low surrogate 
values, each of which are code units.
And, though the size of the UnicodeString is 16 bytes, ElementToCharLen 
accurately returns that there were a total of 7 code points in the string.
While these answers suffice for surrogate pairs, unfortunately, things are not 
exactly the same when it comes to composite characters. Specifically, when a 
UnicodeString contains at least one composite character, that composite 
character may occupy two or more code units, though only one actual character 
will appear in the displayed string. 
Furthermore,
ElementToCharLen is designed specifically to handle surrogate pairs, and not 
composite characters.
Actually, composite characters introduce an issue of string normalization, 
which is not currently handled by Delphi's RTL (runtime library). When I asked 
Seppy Bloom about this, he replied that Microsoft has recently added 
normalization APIs (application programming
interfaces) to some of the latest versions of Windows, ® including Windows® 
Vista, Windows® Server 2008, and Windows® 7.

Seppy was also kind enough to offer a code sample of how you might count the 
number of characters in a UnicodeString that includes at least one composite 
character. I am including this code here for your benefit, but I must offer 
these cautions. 
First, this code
has not been thoroughly tested, and has not been certified. If you use it, you 
do so at your own risk. Second, be aware that this code will not work on 
pre-Windows XP installations, and will only work with Windows XP if you have 
installed the Microsoft Internationalized Domain Names (IDN) Mitigation APIs 
1.1."

http://www.embarcadero.com/images/dm/technical-papers/delphi-unicode-migration.pdf

Elsewhere he implies that Delphi can handle normalised strings for comparisons 
if one is careful, as in

var
s1, s2: String;
begin
ListBox1.Items.Clear;
s1 := 'Hell'#$006F + #$0308' W'#$006F + #$0308'rld';            //make using 
surrogate pairs
s2 := 'Hellö Wörld';
ListBox1.Items.Add(s1);
ListBox1.Items.Add(s2);
ListBox1.Items.Add(BoolToStr(s1 = s2, True)); 
ListBox1.Items.Add(BoolToStr(AnsiCompareStr(s1, s2) = 0, True)); The contents 
of ListBox1 are shown in the following figure.

Hellö Wörld
Hellö Wörld
False
True

Now I am not sure if the above example will show properly in email - because 
email text is generally limited to the ASCII characters and lists like this 
usually also restrict to text and not HTML emails.   So as a related 
exercise I am curious whether the above example prints OK on the list......the 
words  hello and world should have umlaut  (..) over each o in case it doesn't 
arrive like that on the list.

John

As I understand it iterating over a string with Chars does get around the 
problem of surrogate pairs

It depends what you mean by “get around the problem”.


_______________________________________________
NZ Borland Developers Group - Delphi mailing list
Post: delphi@delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: 
unsubscribe



_______________________________________________
NZ Borland Developers Group - Delphi mailing list
Post: delphi@delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: 
unsubscribe

Re: [DUG] Upgrading to XE - Unicode strings questions

Reply via email to