?As I understand it iterating over a string with Chars does get around the
problem of surrogate pairs, as any character you are currently on might be
either 1,2 or more bytes if it contains surrogate pairs, but just one unicode
character. So if one is after iterating over the characters in the string
your code should be perfect.
My question is if you are not using for C in String1 do and want to use
for i:=1 to length(string1) do
what do you use instead of length to get the number of characters in the
string in general? length is not the number of characters, its the umber of
code-points (including surrogate pairs counted separately) if I understand
correctly.
Separate issue - I understand that if one wants to iterate over the bytes of a
string then one uses byte rather than char, and then one does have to
investigate each byte to see if it is part of a surrogate pair. There look to
be routines for this – however I am guessing most won’t be needing to do this.
Fortunately!
Also – I think getting what we used to call the ASCII value of a character, or
creating a character still works the same- in fact for english alphabet the
codes are the same I understand? Can someone confirm. (ie the character
might use 2 bytes if encoded as unicode string, but the value stored for ‘A’ is
still 41 hex or 65 decimal. Which means I think that one can do
code1,code2:integer;
char1:ansichar;
char2:char;
char1:=’A’;
char2:=’A’; //unicode char 2 bytes
code1:=ord(char1);
code2:=ord(char2);
in this case I think code1=code2 ?? anyone confirm this. Of course once one
goes away from English/latin 8859 characters this is no longer going to be true.
John
Doh! Thanks Jolyon for clearing that misunderstanding on my part. I was aware
of the surrogate pair issue but I wrongly assumed that this might have been
taken care by the iterator implementation. I guess not.
Thanks again!
Cheers,
Colin
On 23 November 2010 13:06, Jolyon Smith <jsm...@deltics.co.nz> wrote:
Colin, the for C in loop and the for i := 1 to Length() loops are
functionally identical! The only difference is that the “for in” version
incurs the slight overhead of the enumerator framework invoked by the compiler
and runtime magic to support that syntax.
But in neither case will the loop itself help detect/respond to surrogate
pairs (a single “WideChar” is potentially only ½ the data required to form a
complete “character”). The only way to reduce an iterator over a string to a
simple char-wise loop, whether explicit or using enumerators, is to first
convert to UTF32, the facilities for which in the Delphi RTL are <cough>
rudimentary, to put it politely. Non-existent may be nearer the mark.
The precise mechanics of the loop construct used is not material to that
problem.
However, just as before Unicode when most people didn’t care and just wrote
code that assumed ANSI==ASCII, these days people won’t care and will write code
that assumes that Unicode==BMP (Basic Multilingual Plane), ignoring surrogate
pairs just as they used to ignore extended ASCII and ANSI characters.
And for most people, that will probably actually work.
J
From: delphi-boun...@delphi.org.nz [mailto:delphi-boun...@delphi.org.nz] On
Behalf Of Colin Johnsun
Sent: Tuesday, 23 November 2010 14:31
To: NZ Borland Developers Group - Delphi List
Subject: Re: [DUG] Upgrading to XE - Unicode strings questions
I won't answer everything but just on this one question:
On 23 November 2010 11:04, John Bird <johnkb...@paradise.net.nz> wrote:
Extra question:
It looks like code like
for i:=1 to length(string1) do
begin
DoSomethingWithOneChar(string1[i]);
end;
cannot be used reliably. The problems are that length(string1) looks like
it cannot be safely used - as unicode characters may include 2 codepoints
and length(string1) highlights that there is a difference between the number
of unicode characters in a string and the number of codepoints. Still
figuring out what is the best practice here, as I have quite a lot of string
routines. Should be be OK as long as the unicode text actually is ASCII.
you can use something like this:
var
C: Char;
...
for C in String1 do
begin
DoSomethingWithOneChar(C);
end;
In this case you don't need to know the index of each character, you just get
the char using the for..in..do loop.
_______________________________________________
NZ Borland Developers Group - Delphi mailing list
Post: delphi@delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject:
unsubscribe
--------------------------------------------------------------------------------
_______________________________________________
NZ Borland Developers Group - Delphi mailing list
Post: delphi@delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject:
unsubscribe
_______________________________________________
NZ Borland Developers Group - Delphi mailing list
Post: delphi@delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject:
unsubscribe