Re: [DUG] Upgrading to XE - Unicode strings questions

Stefan Mueller Mon, 22 Nov 2010 19:04:51 -0800

You are absolutely right  if you need to know the real number of
characters then utf32 is the way to go. I use the jedi-library for some
advanced things  they have a unicode library that supports utf32/ucs-4
properly together with helper functions that actually work correctly for
changing things like uppercase/lowercase on those characters.


 

But for most people the scripts/languages supported in the basic
multilingual plane (plane 0 .. or what is known as the characters that fit
into the first 64k range and hence have no problem with being represented as
UTF16/UCS-2) will do just fine  occurrences of codepoints above the 64k
range dont really happen in the real world  they are special cases and for
most applications it isnt worth the trouble/effort to handle them. 

 


Kind Regards,
Stefan Mueller 
_______________________
R&D Manager
ORCL Toolbox LLP, Japan
http://www.orcl-toolbox.com <http://www.orcl-toolbox.com/>  

 

 

From: [email protected] [mailto:[email protected]] On
Behalf Of Jolyon Smith
Sent: Tuesday, November 23, 2010 11:07 AM
To: 'NZ Borland Developers Group - Delphi List'
Subject: Re: [DUG] Upgrading to XE - Unicode strings questions

 

Colin, the for C in loop and the for i := 1 to Length() loops are
functionally identical!  The only difference is that the for in version
incurs the slight overhead of the enumerator framework invoked by the
compiler and runtime magic to support that syntax.

 

But in neither case will the loop itself help detect/respond to surrogate
pairs (a single WideChar is potentially only ½ the data required to form a
complete character).  The only way to reduce an iterator over a string to
a simple char-wise loop, whether explicit or using enumerators, is to first
convert to UTF32, the facilities for which in the Delphi RTL are <cough>
rudimentary, to put it politely.  Non-existent may be nearer the mark.

 

The precise mechanics of the loop construct used is not material to that
problem.

 

 

However, just as before Unicode when most people didnt care and just wrote
code that assumed ANSI==ASCII, these days people wont care and will write
code that assumes that Unicode==BMP (Basic Multilingual Plane), ignoring
surrogate pairs just as they used to ignore extended ASCII and ANSI
characters.

 

And for most people, that will probably actually work.

 

J

 

 

From: [email protected] [mailto:[email protected]] On
Behalf Of Colin Johnsun
Sent: Tuesday, 23 November 2010 14:31
To: NZ Borland Developers Group - Delphi List
Subject: Re: [DUG] Upgrading to XE - Unicode strings questions

 

I won't answer everything but just on this one question:

On 23 November 2010 11:04, John Bird <[email protected]> wrote:

Extra question:

It looks like code like

   for i:=1 to length(string1) do
   begin
           DoSomethingWithOneChar(string1[i]);
   end;

cannot be used reliably.   The problems are that length(string1) looks like
it cannot be safely used - as unicode characters may include 2 codepoints
and length(string1) highlights that there is a difference between the number
of unicode characters in a string and the number of codepoints.   Still
figuring out what is the best practice here, as I have quite a lot of string
routines.   Should be be OK as long as the unicode text actually is ASCII.

 

 

you can use something like this:

 

var

  C: Char;

...

  for C in String1 do

  begin

    DoSomethingWithOneChar(C);

  end;

 

In this case you don't need to know the index of each character, you just
get the char using the for..in..do loop.

_______________________________________________
NZ Borland Developers Group - Delphi mailing list
Post: [email protected]
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to [email protected] with Subject: 
unsubscribe

Re: [DUG] Upgrading to XE - Unicode strings questions

Reply via email to