Re: [DUG] Upgrading to XE - Unicode strings questions

John Bird Mon, 22 Nov 2010 19:13:59 -0800

?As I understand it iterating over a string with Chars does get around the 
problem of surrogate pairs, as any character you are currently on might be 
either 1,2 or more bytes if it contains surrogate pairs, but just one unicode 
character.   So if one is after iterating over the characters in the string 
your code should be perfect.


My question is if you are not using   for C in String1 do and want to use   
for i:=1 to length(string1) do

what do  you use instead of length to get the number of characters in the 
string in general?  length is not the number of characters, its the umber of 
code-points (including surrogate pairs counted separately)  if I understand 
correctly.

Separate issue - I understand that if one wants to iterate over the bytes of a 
string then one uses byte rather than char, and then one does have to 
investigate each byte to see if it is part of a surrogate pair.  There look to 
be routines for this – however I am guessing most won’t be needing to do this. 
Fortunately!


Also – I think  getting what we used to call the ASCII value of a character, or 
creating a character still works the same-  in fact for english alphabet the 
codes are the same I understand?  Can someone confirm.   (ie the character 
might use 2 bytes if encoded as unicode string, but the value stored for ‘A’ is 
still 41 hex or 65 decimal.   Which means I think that one can do


code1,code2:integer;
char1:ansichar;
char2:char;

    char1:=’A’;
    char2:=’A’;            //unicode char 2 bytes
    code1:=ord(char1);
    code2:=ord(char2);

in this case I think code1=code2 ??  anyone confirm this.   Of course once one 
goes away from English/latin 8859 characters this is no longer going to be true.



John
 
Doh! Thanks Jolyon for clearing that misunderstanding on my part. I was aware 
of the surrogate pair issue but I wrongly assumed that this might have been 
taken care by the iterator implementation. I guess not. 

Thanks again!
Cheers,
Colin

On 23 November 2010 13:06, Jolyon Smith <[email protected]> wrote:

  Colin, the for C in loop and the for i := 1 to Length() loops are 
functionally identical!  The only difference is that the “for in” version 
incurs the slight overhead of the enumerator framework invoked by the compiler 
and runtime magic to support that syntax.



  But in neither case will the loop itself help detect/respond to surrogate 
pairs (a single “WideChar” is potentially only ½ the data required to form a 
complete “character”).  The only way to reduce an iterator over a string to a 
simple char-wise loop, whether explicit or using enumerators, is to first 
convert to UTF32, the facilities for which in the Delphi RTL are <cough> 
rudimentary, to put it politely.  Non-existent may be nearer the mark.



  The precise mechanics of the loop construct used is not material to that 
problem.





  However, just as before Unicode when most people didn’t care and just wrote 
code that assumed ANSI==ASCII, these days people won’t care and will write code 
that assumes that Unicode==BMP (Basic Multilingual Plane), ignoring surrogate 
pairs just as they used to ignore extended ASCII and ANSI characters.



  And for most people, that will probably actually work.



  J





  From: [email protected] [mailto:[email protected]] On 
Behalf Of Colin Johnsun
  Sent: Tuesday, 23 November 2010 14:31
  To: NZ Borland Developers Group - Delphi List


  Subject: Re: [DUG] Upgrading to XE - Unicode strings questions


  I won't answer everything but just on this one question:

  On 23 November 2010 11:04, John Bird <[email protected]> wrote:

  Extra question:

  It looks like code like

     for i:=1 to length(string1) do
     begin
             DoSomethingWithOneChar(string1[i]);
     end;

  cannot be used reliably.   The problems are that length(string1) looks like
  it cannot be safely used - as unicode characters may include 2 codepoints
  and length(string1) highlights that there is a difference between the number
  of unicode characters in a string and the number of codepoints.   Still
  figuring out what is the best practice here, as I have quite a lot of string
  routines.   Should be be OK as long as the unicode text actually is ASCII.





  you can use something like this:



  var

    C: Char;

  ...

    for C in String1 do

    begin

      DoSomethingWithOneChar(C);

    end;



  In this case you don't need to know the index of each character, you just get 
the char using the for..in..do loop.








  _______________________________________________
  NZ Borland Developers Group - Delphi mailing list
  Post: [email protected]
  Admin: http://delphi.org.nz/mailman/listinfo/delphi
  Unsubscribe: send an email to [email protected] with Subject: 
unsubscribe




--------------------------------------------------------------------------------
_______________________________________________
NZ Borland Developers Group - Delphi mailing list
Post: [email protected]
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to [email protected] with Subject: 
unsubscribe

_______________________________________________
NZ Borland Developers Group - Delphi mailing list
Post: [email protected]
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to [email protected] with Subject: 
unsubscribe

Re: [DUG] Upgrading to XE - Unicode strings questions

Reply via email to