subject:"\[DUG\] Upgrading to XE \- Unicode strings questions"

Re: [DUG] Upgrading to XE - Unicode strings questions

2010-11-23 Thread John Bird

Iterating over a string is for the purpose of doing something with each 
individual character..whether it is a ‘A’   or a 'A' with a ^ (caret) on 
top of it.   When I said the number of bytes in a character varies I was not 
meaning the number of bytes in a Char - I was meaning the total number of 
bytes in a one resulting character or letter might vary.   For instance the 
word fiancee  (with an acute on the last e) has 7 characters, the last of 
which might be 2 code units

When I iterate over a string I ideally want to get one character in the word 
each time:

could I build a string like this?

setlength(String1,7);
string1[1] := 'f';
string1[2] := 'i';
string1[3] := 'a';
string1[4] := 'n';
string1[5] := 'c';
string1[6] := 'e';
string1[7] := 'e';//I would want the full e acute here

hence I want to be able to go

for i :=1 to length(string1) do
begin
thisChar:=string1[i];//get each character one at a time
listbox1.items.add('i=' + inttostr(i)+'  character at position i 
= ' +ThisChar;
end

I would be expecting to see 7 characters, 7 lines in the list box, and 
length=7,  with the last being e acute.
Now everything Jolyon  are saying and Cary also implies that this is not 
going to work.   This looks to be a real nuisance!

Now I think the e acute could be one unicode character (as there is likely 
to be a representation using one character, one code point and one code 
unit) or as one character, two code units, 2*2 bytes - a surrogate pair - 
where eg one supplies the e and one the acute.   So it looks like what I see 
might vary according to how the e acute is encoded in the string?

As I read further this gets murkier, as some of the things Cary Jensen says 
are not the same as what you say even if you say it emphatically!

This is why I am thinking we have to understand clearly Unicode, and the 
Windows implementation of it.and I don't really yet.

Here is what Cary Jensen says about a similar example with 7 characters, one 
of which is a surrogate pair:


Although there are 7 characters in the printed string, the UnicodeString 
contains 8 code
units, as returned by the Length function. Inspection of the 6th and 7th 
elements of the
UnicodeString reveal the high and low surrogate values, each of which are 
code units.
And, though the size of the UnicodeString is 16 bytes, ElementToCharLen 
accurately
returns that there were a total of 7 code points in the string.
While these answers suffice for surrogate pairs, unfortunately, things are 
not exactly the
same when it comes to composite characters. Specifically, when a 
UnicodeString contains
at least one composite character, that composite character may occupy two or 
more code
units, though only one actual character will appear in the displayed string. 
Furthermore,
ElementToCharLen is designed specifically to handle surrogate pairs, and not 
composite
characters.
Actually, composite characters introduce an issue of string normalization, 
which is not
currently handled by Delphi's RTL (runtime library). When I asked Seppy 
Bloom about this,
he replied that Microsoft has recently added normalization APIs (application 
programming
interfaces) to some of the latest versions of Windows, ® including Windows® 
Vista,
Windows® Server 2008, and Windows® 7.

Seppy was also kind enough to offer a code sample of how you might count the 
number of
characters in a UnicodeString that includes at least one composite 
character. I am
including this code here for your benefit, but I must offer these cautions. 
First, this code
has not been thoroughly tested, and has not been certified. If you use it, 
you do so at your
own risk. Second, be aware that this code will not work on pre-Windows XP 
installations,
and will only work with Windows XP if you have installed the Microsoft 
Internationalized
Domain Names (IDN) Mitigation APIs 1.1.

http://www.embarcadero.com/images/dm/technical-papers/delphi-unicode-migration.pdf

Elsewhere he implies that Delphi can handle normalised strings for 
comparisons if one is careful, as in

var
s1, s2: String;
begin
ListBox1.Items.Clear;
s1 := 'Hell'#$006F + #$0308' W'#$006F + #$0308'rld';//make using 
surrogate pairs
s2 := 'Hellö Wörld';
ListBox1.Items.Add(s1);
ListBox1.Items.Add(s2);
ListBox1.Items.Add(BoolToStr(s1 = s2, True));
ListBox1.Items.Add(BoolToStr(AnsiCompareStr(s1, s2) = 0, True));
The contents of ListBox1 are shown in the following figure.

Hellö Wörld
Hellö Wörld
False
True

Now I am not sure if the above example will show properly in email - because 
email text is generally limited to the ASCII characters and lists like this 
usually also restrict to text and not HTML emails.   So as a related 
exercise I am curious whether the above example prints OK on the 
list..the words  hello and world should have umlaut  (..) over each o in 
case it doesn't arrive like that on the list.

John

As I understand it iterating over a string with Chars

Re: [DUG] Upgrading to XE - Unicode strings questions

2010-11-23 Thread Stefan Mueller

John,

I think you are confusing Canonical  Normalized versions of the same Unicode 
string (in the example s1 is canonical, s2 is normalized) and the effect of 
local codepage conversion.

Windows-1252 codepage (latin ISO 8859-1) has support for characters like the 
ö (ascii code #246) and é (ascii code #130). Converting to 
ansistring/ansichar on your system will take care of canonical Unicode 
representation and hence return true if you compare those strings. Please note 
that this only works because your system is set to a latin based codepage ... 
do the same on a Japanese version of windows and you'll get a very different 
result as there is no support for ö in ansistring under Japanese codepage! 
Because your system is Latin your first testcase/example of you building the 
word finance should actually work without problems - Joylon/Cary are probably 
wrong if they indeed implied that this wouldn't work.

The ö can be written as a compound #$006F + #$0308 in canonical format ... 
and as #$00f6 in the normalized format. For most normal applications it just 
doesn't really matter either way because a user that is inputting text under 
his local codepage will always do it the same way and hence chances of you 
encountering a mix between canonical/normalized version will be close to zero. 
You only ever get issues if you cross codepage boundaries (like for example if 
you have users in different countries storing data in a database - which is why 
international databases often use UTF-8 to store data instead of their native 
charactersets). Most of the better databases (like for example Oracle) have 
built in support for sorting and handling canonical format and do the 
conversion automatically for you  ... for someone writing desktop applications 
it usually just isn't an issue either way. 


Kind Regards,
Stefan Mueller 
___
RD Manager
ORCL Toolbox LLP, Japan
http://www.orcl-toolbox.com 
 


-Original Message-
From: delphi-boun...@delphi.org.nz [mailto:delphi-boun...@delphi.org.nz] On 
Behalf Of John Bird
Sent: Tuesday, November 23, 2010 7:33 PM
To: NZ Borland Developers Group - Delphi List
Subject: Re: [DUG] Upgrading to XE - Unicode strings questions

Iterating over a string is for the purpose of doing something with each 
individual character..whether it is a ‘A’   or a 'A' with a ^ (caret) on 
top of it.   When I said the number of bytes in a character varies I was not 
meaning the number of bytes in a Char - I was meaning the total number of 
bytes in a one resulting character or letter might vary.   For instance the 
word fiancee  (with an acute on the last e) has 7 characters, the last of which 
might be 2 code units

When I iterate over a string I ideally want to get one character in the word 
each time:

could I build a string like this?

setlength(String1,7);
string1[1] := 'f';
string1[2] := 'i';
string1[3] := 'a';
string1[4] := 'n';
string1[5] := 'c';
string1[6] := 'e';
string1[7] := 'e';//I would want the full e acute here

hence I want to be able to go

for i :=1 to length(string1) do
begin
thisChar:=string1[i];//get each character one at a time
listbox1.items.add('i=' + inttostr(i)+'  character at position i = 
' +ThisChar;
end

I would be expecting to see 7 characters, 7 lines in the list box, and 
length=7,  with the last being e acute.
Now everything Jolyon  are saying and Cary also implies that this is not 
going to work.   This looks to be a real nuisance!

Now I think the e acute could be one unicode character (as there is likely to 
be a representation using one character, one code point and one code
unit) or as one character, two code units, 2*2 bytes - a surrogate pair - 
where eg one supplies the e and one the acute.   So it looks like what I see 
might vary according to how the e acute is encoded in the string?

As I read further this gets murkier, as some of the things Cary Jensen says are 
not the same as what you say even if you say it emphatically!

This is why I am thinking we have to understand clearly Unicode, and the 
Windows implementation of it.and I don't really yet.

Here is what Cary Jensen says about a similar example with 7 characters, one of 
which is a surrogate pair:


Although there are 7 characters in the printed string, the UnicodeString 
contains 8 code units, as returned by the Length function. Inspection of the 
6th and 7th elements of the UnicodeString reveal the high and low surrogate 
values, each of which are code units.
And, though the size of the UnicodeString is 16 bytes, ElementToCharLen 
accurately returns that there were a total of 7 code points in the string.
While these answers suffice for surrogate pairs, unfortunately, things are not 
exactly the same when it comes to composite characters. Specifically, when a 
UnicodeString contains at least one composite character, that composite 
character may occupy two or more code units, though only one

Re: [DUG] Upgrading to XE - Unicode strings questions

2010-11-23 Thread Jolyon Smith

 I think you are confusing Canonical  Normalized versions 
 of the same Unicode string (in the example s1 is canonical, 
 s2 is normalized) and the effect of local codepage conversion.

Yep, and for the record I think this is a big problem with the way Embarcadero 
implemented Unicode.

By pursuing the Unicode is a no-brainer approach (facilitating easy migration 
for ASCII apps) they have obfuscated the fact that Unicode is far from simple.  
Or at least doing it right is.

Danny Thorpe opined years ago that it made a lot of sense to do 64-bit and 
Unicode in one go as a big-bang breaking change, leaving the 32-bit, ANSI VCL 
product behind as a legacy platform.  Danny Thorpe always was a clever guy!  ;)


 
 The ö can be written as a compound #$006F + #$0308 in 
 canonical format ... and as #$00f6 in the normalized 
 format. For most normal applications it just doesn't really 
 matter either way because a user that is inputting text under 
 his local codepage will always do it the same way

A user could specifically choose to enter that character in either form - this 
is unlikely, yes.  Or, two users using the same codepage could choose to enter 
the character differently.

Or if your data is coming from two separate external sources.

The *only* way to be sure is to normalise before processing.


 You only ever get issues if you cross codepage boundaries 
 (like for example if you have users in different countries 
 storing data in a database - which is why international 
 databases often use UTF-8 to store data instead of their 
 native charactersets).

This makes no sense at all to me.

ö encoded as #$006F + #$0308 **OR** #$00f6 even in UTF-8.  Whether you encode 
using UTF-8, UTF-16 or UTF-32, a single accented character codepoint vs a 
character followed by a diacritic are still two distinct character sequences.


___
NZ Borland Developers Group - Delphi mailing list
Post: delphi@delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: 
unsubscribe

Re: [DUG] Upgrading to XE - Unicode strings questions

2010-11-23 Thread Jolyon Smith

John, the problem is that in Unicode single character is meaningless unless 
you have performed some pre-processing to GIVE that term some meaning.  There 
are some standard forms for such processing, called Normalisations.

The problem is that a single character to your eyes, e.g. an accented a, 
could be represented in a Unicode string in at least two ways:

  1.  A single codepoint represented that accented a

  2.  TWO codepoints - the first representing a and the
  second a diacritic codepoint for the accent


 Iterating over a string is for the purpose of doing something with each
 individual character

That's fine, but in Unicode what you have is a string not of characters but of 
codepoints.  The concept of a character is not synonymous with codepoint in 
Unicode in the same way that it is with ASCII or even ANSI.

So you have compounded complications:

a.  Depending on encoding, a single codepoint (32-bit value) 
 may be encoded in 1, 2, or more bytes.  Each byte may 
 represent a whole codepoint or only part of a codepoint 
 encoding.

b.  Each codepoint may represent a whole character or only 
 PART of a character encoding.


Complication 'a' can be avoided by adopting UTF-32 encoding - 4 bytes for EVERY 
codepoint.  That is hugely wasteful in terms of memory/storage for most 
applications.  UTF-16 - the encoding used by Delphi and indeed by Windows 
natively itself - is a compromise.  It is less efficient than ANSI for ASCII, 
but more efficient that UTF-32 for ANSI characters sets represented in the BMP.

For applications working entirely in the BMP UTF-16 is also relatively easy to 
process - for NORMALISED strings, each codepoint IS a character (in the BMP).  
But for non-normalised data that is still not necessarily the case.



 could I build a string like this?

 setlength(String1,7);
 string1[1] := 'f';
 string1[2] := 'i';
 string1[3] := 'a';
 string1[4] := 'n';
 string1[5] := 'c';
 string1[6] := 'e';
 string1[7] := 'e';//I would want the full e acute here

Yes, you can.

But you might also *receive* from another source, a string that is apparently 
the same at the visual representation level, but different at the data level, 
where:

 string1[1] = 'f';
 string1[2] = 'i';
 string1[3] = 'a';
 string1[4] = 'n';
 string1[5] = 'c';
 string1[6] = 'e';
 string1[7] = 'e';// Normal 'e' character, i.e. identical to 
string1[6]
 string1[8] = U+0301; // Combining acute diacritic

When displayed on screen this string will appear identical to your string, but 
it is represented in the data in a different way.


 hence I want to be able to go

for i :=1 to length(string1) do
begin
 ..
end

 Now everything Jolyon  are saying and Cary also implies that this is
 not going to work.   This looks to be a real nuisance!

I don't know what gave you that impression from what I said.

Yes, Unicode is/can be a real nuisance - *properly* supporting it is a lot more 
work than people think - but what you want to do here can be done.



 Now I think the e acute could be one unicode character (as there is likely 
 to be a representation using one character, one code point and one code 
 unit) or as one character, two code units, 2*2 bytes - a surrogate pair - 
 where eg one supplies the e and one the acute.   

NO!!!  This is NOT what a surrogate pair is.

A surrogate pair is encountered ONLY in UTF-16, and is found when you have a 
codepoint that is not in the BMP.  i.e. a value  65535 that cannot be encoded 
in a 16-bit value.  These are typically CJVK characters 
(Chinese/Japanese/Vietnamese/Korean) sometimes called Han or Kanji character 
sets.

The first 16-bit value indicates a page in the non-BMP.  The following 16-bit 
value then identifies an entry in that page.  To obtain the codepoint that 
the PAIR of VALUES represents, you have to apply a transform, combining the 
page selector with the page entry.  But what you get is a single codepoint.  
(you don't have to do this - there are routines to do it for you, but you have 
to invoke them as appropriate).

A Surrogate Pair is a representation of a single codepoint, NOT a relationship 
between TWO codepoints.



When you have a visual character encoded as a codepoint + a following, 
combining codepoint, that is simply TWO Unicode codepoints that are combined to 
form one VISUAL character.  That is NOT a surrogate pair however.  It is 
merely two codepoints that have to be combined.


___
NZ Borland Developers Group - Delphi mailing list
Post: delphi@delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: 
unsubscribe

Re: [DUG] Upgrading to XE - Unicode strings questions

2010-11-23 Thread Todd

Hi John

You can find out whether a unicode string is inside the BMP by 
converting it to UTF-32 and checking that the new string is twice the 
length of the original (UTF-16) string.
 A user could specifically choose to enter that character in either form - 
 this is unlikely, yes.  Or, two users using the same codepage could choose to 
 enter the character differently.

 Or if your data is coming from two separate external sources.

 The *only* way to be sure is to normalise before processing.

Agreed. That will eliminate any issues with composite codepoints.
 You only ever get issues if you cross codepage boundaries
 (like for example if you have users in different countries
 storing data in a database - which is why international
 databases often use UTF-8 to store data instead of their
 native charactersets).
  
 This makes no sense at all to me.

 ö encoded as #$006F + #$0308 **OR** #$00f6 even in UTF-8.  Whether you 
 encode using UTF-8, UTF-16 or UTF-32, a single accented character codepoint 
 vs a character followed by a diacritic are still two distinct character 
 sequences.

True. I think the point is that UTF-8 is the most compact format without 
data loss, regardless of whether the codepoints are composite or not.

Todd.

___
NZ Borland Developers Group - Delphi mailing list
Post: delphi@delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: 
unsubscribe

Re: [DUG] Upgrading to XE - Unicode strings questions

2010-11-23 Thread John Bird

?I read in one of the references that UTF-32 was a more common standard on 
Unix systems - which means I guess they have chosen the simplest format at 
the trade off of using more space?

I think linux/Windows/MacOS use UTF-16 more commonly...

Anyway for the time being, as long as the data in strings is unicode, but is 
still Latin 8859 (ie ASCII characters) I can without worrying too much 
iterate over a string one character at a time...using length.

That was the main thing I wanted to know

John

___
NZ Borland Developers Group - Delphi mailing list
Post: delphi@delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: 
unsubscribe

Re: [DUG] Upgrading to XE - Unicode strings questions

2010-11-23 Thread Jolyon Smith

 Anyway for the time being, as long as the data in 
 strings is unicode, but is still Latin 8859 (ie ASCII 
 characters) I can without worrying too much iterate over 
 a string one character at a time...using length.

Yep.  But you are building an app that now supports Unicode.

If your users are able to enter data into your app, your app will now
*potentially* find itself handling Unicode data for which it was not
designed, unless you take additional steps to now prevent a user from
entering non-ASCII data in the first place.

Previously you may not have taken these steps so theoretically could have
found a user entering non-ASCII, ANSI characters too, except that in the
past you would not have been using Unicode support as an advertised (or
even unadvertised) feature of your app and could legitimately have told such
users not to be so dumb (in not so many words, of course :D)


This again is the danger of the no brainer approach with the Unicode
migration in Delphi.

By selling the idea that switching to Unicode was easy, they have just made
it more confusing in many cases, imho.


If I can just recompile and patch up a few warnings with some boilerplate,
how come there's all this other stuff that I need to do too?  I thought
Unicode was supposed to make supporting this stuff easier.

Answer: It does.  It make supporting Unicode easier, but supporting Unicode
is not, itself, easy.

imho

___
NZ Borland Developers Group - Delphi mailing list
Post: delphi@delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: 
unsubscribe

Re: [DUG] Upgrading to XE - Unicode strings questions

2010-11-23 Thread Ross Levis

It's a shame UTF-8 wasn't made the standard in Delphi.  It's commonly used in 
audio file tags, for example, which I have to deal with.

My software needs to search for songs with specific artists or titles, and it 
sounds like I'm going to have problems where the information is visually the 
same but entered differently in different parts of the world, using all sorts 
of 3rd party software.

Ross.
 
-Original Message-
From: delphi-boun...@delphi.org.nz [mailto:delphi-boun...@delphi.org.nz] On 
Behalf Of Todd
Sent: Wednesday, 24 November 2010 11:27 AM
To: NZ Borland Developers Group - Delphi List
Subject: Re: [DUG] Upgrading to XE - Unicode strings questions

Hi John

You can find out whether a unicode string is inside the BMP by 
converting it to UTF-32 and checking that the new string is twice the 
length of the original (UTF-16) string.
 A user could specifically choose to enter that character in either form - 
 this is unlikely, yes.  Or, two users using the same codepage could choose to 
 enter the character differently.

 Or if your data is coming from two separate external sources.

 The *only* way to be sure is to normalise before processing.

Agreed. That will eliminate any issues with composite codepoints.
 You only ever get issues if you cross codepage boundaries
 (like for example if you have users in different countries
 storing data in a database - which is why international
 databases often use UTF-8 to store data instead of their
 native charactersets).
  
 This makes no sense at all to me.

 ö encoded as #$006F + #$0308 **OR** #$00f6 even in UTF-8.  Whether you 
 encode using UTF-8, UTF-16 or UTF-32, a single accented character codepoint 
 vs a character followed by a diacritic are still two distinct character 
 sequences.

True. I think the point is that UTF-8 is the most compact format without 
data loss, regardless of whether the codepoints are composite or not.

Todd.

___
NZ Borland Developers Group - Delphi mailing list
Post: delphi@delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: 
unsubscribe



___
NZ Borland Developers Group - Delphi mailing list
Post: delphi@delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: 
unsubscribe

Re: [DUG] Upgrading to XE - Unicode strings questions

2010-11-23 Thread Jolyon Smith

You should be fine - you just have to ensure you normalise the strings.

You're going to have to convert from UTF-8 to UTF-16 to bring them in to your 
Delphi app anyway, for processing, so you may as well normalise them in the 
process.


UTF-16 was chosen in Delphi because it is also the native encoding in Windows 
itself.



-Original Message-
From: delphi-boun...@delphi.org.nz [mailto:delphi-boun...@delphi.org.nz] On 
Behalf Of Ross Levis
Sent: Wednesday, 24 November 2010 16:00
To: 'NZ Borland Developers Group - Delphi List'
Subject: Re: [DUG] Upgrading to XE - Unicode strings questions

It's a shame UTF-8 wasn't made the standard in Delphi.  It's commonly used in 
audio file tags, for example, which I have to deal with.

My software needs to search for songs with specific artists or titles, and it 
sounds like I'm going to have problems where the information is visually the 
same but entered differently in different parts of the world, using all sorts 
of 3rd party software.

Ross.
 
-Original Message-
From: delphi-boun...@delphi.org.nz [mailto:delphi-boun...@delphi.org.nz] On 
Behalf Of Todd
Sent: Wednesday, 24 November 2010 11:27 AM
To: NZ Borland Developers Group - Delphi List
Subject: Re: [DUG] Upgrading to XE - Unicode strings questions

Hi John

You can find out whether a unicode string is inside the BMP by 
converting it to UTF-32 and checking that the new string is twice the 
length of the original (UTF-16) string.
 A user could specifically choose to enter that character in either form - 
 this is unlikely, yes.  Or, two users using the same codepage could choose to 
 enter the character differently.

 Or if your data is coming from two separate external sources.

 The *only* way to be sure is to normalise before processing.

Agreed. That will eliminate any issues with composite codepoints.
 You only ever get issues if you cross codepage boundaries
 (like for example if you have users in different countries
 storing data in a database - which is why international
 databases often use UTF-8 to store data instead of their
 native charactersets).
  
 This makes no sense at all to me.

 ö encoded as #$006F + #$0308 **OR** #$00f6 even in UTF-8.  Whether you 
 encode using UTF-8, UTF-16 or UTF-32, a single accented character codepoint 
 vs a character followed by a diacritic are still two distinct character 
 sequences.

True. I think the point is that UTF-8 is the most compact format without 
data loss, regardless of whether the codepoints are composite or not.

Todd.

___
NZ Borland Developers Group - Delphi mailing list
Post: delphi@delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: 
unsubscribe



___
NZ Borland Developers Group - Delphi mailing list
Post: delphi@delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: 
unsubscribe


___
NZ Borland Developers Group - Delphi mailing list
Post: delphi@delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: 
unsubscribe

Re: [DUG] Upgrading to XE - Unicode strings questions

2010-11-22 Thread John Bird

Thanks for the references, so I can answer most of the questions now. 
Here is what I understand so far, if anyone has anything to add this will be 
useful!

Extra question:

It looks like code like

for i:=1 to length(string1) do
begin
DoSomethingWithOneChar(string1[i]);
end;

cannot be used reliably.   The problems are that length(string1) looks like 
it cannot be safely used - as unicode characters may include 2 codepoints 
and length(string1) highlights that there is a difference between the number 
of unicode characters in a string and the number of codepoints.   Still 
figuring out what is the best practice here, as I have quite a lot of string 
routines.   Should be be OK as long as the unicode text actually is ASCII.

Q2 – With XE do the .pas and .dfm files become unicode text and hence cannot
be read by earlier Delphi, eg D2007 any more?

Answer - Is a project option from what I have read?, yes not portable if 
unicode.

Q3 – I do a lot of reading ascii data files, and writing back.   Using
mainly TFilestream and stringlists.   Does this in general mean I will need
to use file variables declared as Ansichar and AnsiString instead of Char
and String?
(I would prefer to use the standard VCL where possible)

If I have variables
as1:Ansistring;
s2:string;

Q4 – if I do s2:=as1  does this convert ansistrings to unicode?

Answer - yes, there are performance issues to watch out for if conversion 
happens a lot.

Q5 – if I do as1:=s2 does this convert a unicode string to ansistring?

(otherwise how do I do this?)

Answer - yes, there are performance issues to watch out for if conversion 
happens a lot.

Q6 – I understand any code like

char1:=string1[i];
if char1 in [‘a’..’z’] then
begin
message:=string[i]+’ - character is lowercase’;
end

will break, as ansi characters are ordinal (less than 256 or 512)
and set comparisons ['a'..'z']  or ['a','b','c']can be used, this set
code cannot be used for unicode characters.   What is the replacement?

Answer - There is CharInSet call and numerous extra housekeeping functions 
added in TCharacter.

Q7 – do literals like  #13#10 still mean carriage return and linefeed?  #9
means tab?
if I have code like (logline string1 string2 are string)

logline:=FormatDateTime(‘dd-mmm- hh:nn:ss’,now) + string1 +
#13#10+#9 + string2;
ShowMessage(logline);
Button1.hint:=logline;
writeln(f,logline);

these work D5-D2007   - ie a 2 line messagebox text, 2 line hint,
and 2 lines written to a log file.
is this still going to work?

do carriage returns/tabs/other control characters have to be defined
differently, eg as constants?

Answer - not figured out yet - anyone else know?

Q8 – stringlist1.loadfromfile(‘Test1.txt’);
what happens if this file is ascii text being read into a stringlist
which is unicode strings.

Answer - Default is Ascii text for loadfromfile and savetofile, use 
overloaded routines for Unicode

Q9 -   stringlist1.savetofile(‘Test1.txt’)
 presumably this is no longer ascii text.   How do I save and read a
stringlist to/from a file if it is to be Ansi text?

Q10 – If there are complexities in Q8 and Q9 is there a TAnsiStringlist
type (for ansistrings) as well as a unicode TStringlist type?
(I use stringlists a lot)

Answer - unicodestring lists can save to ascii or unicode files, so 
TAnsiStringlist not needed.

Q11 – do inifiles become unicode too?

Answer - looks like no?  Not clear?  Anyone else know?

Q12 – does Windows Notepad open unicode text files correctly?   or can it
only be used on Ansi text files?

Anyone know this?

Q13 - It looks like most programmers editors read and write ascii and
unicode encoding.the one I use seems to distinguish between UTF-8 and
unicode as well – what is the difference?

Anyone know this?

John

___
NZ Borland Developers Group - Delphi mailing list
Post: delphi@delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: 
unsubscribe 

___
NZ Borland Developers Group - Delphi mailing list
Post: delphi@delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: 
unsubscribe

Re: [DUG] Upgrading to XE - Unicode strings questions

2010-11-22 Thread David Brennan

Just thought I would chime in that I'm really interested in the answers to 
these questions too (Unicode being something we are also a bit apprehensive of).

-Original Message-
From: delphi-boun...@delphi.org.nz [mailto:delphi-boun...@delphi.org.nz] On 
Behalf Of John Bird
Sent: Tuesday, 23 November 2010 1:04 p.m.
To: NZ Borland Developers Group - Delphi List
Subject: Re: [DUG] Upgrading to XE - Unicode strings questions

Thanks for the references, so I can answer most of the questions now. 
Here is what I understand so far, if anyone has anything to add this will be 
useful!

Extra question:

It looks like code like

for i:=1 to length(string1) do
begin
DoSomethingWithOneChar(string1[i]);
end;

cannot be used reliably.   The problems are that length(string1) looks like 
it cannot be safely used - as unicode characters may include 2 codepoints 
and length(string1) highlights that there is a difference between the number 
of unicode characters in a string and the number of codepoints.   Still 
figuring out what is the best practice here, as I have quite a lot of string 
routines.   Should be be OK as long as the unicode text actually is ASCII.

Q2 – With XE do the .pas and .dfm files become unicode text and hence cannot
be read by earlier Delphi, eg D2007 any more?

Answer - Is a project option from what I have read?, yes not portable if 
unicode.

Q3 – I do a lot of reading ascii data files, and writing back.   Using
mainly TFilestream and stringlists.   Does this in general mean I will need
to use file variables declared as Ansichar and AnsiString instead of Char
and String?
(I would prefer to use the standard VCL where possible)

If I have variables
as1:Ansistring;
s2:string;

Q4 – if I do s2:=as1  does this convert ansistrings to unicode?

Answer - yes, there are performance issues to watch out for if conversion 
happens a lot.

Q5 – if I do as1:=s2 does this convert a unicode string to ansistring?

(otherwise how do I do this?)

Answer - yes, there are performance issues to watch out for if conversion 
happens a lot.

Q6 – I understand any code like

char1:=string1[i];
if char1 in [‘a’..’z’] then
begin
message:=string[i]+’ - character is lowercase’;
end

will break, as ansi characters are ordinal (less than 256 or 512)
and set comparisons ['a'..'z']  or ['a','b','c']can be used, this set
code cannot be used for unicode characters.   What is the replacement?

Answer - There is CharInSet call and numerous extra housekeeping functions 
added in TCharacter.

Q7 – do literals like  #13#10 still mean carriage return and linefeed?  #9
means tab?
if I have code like (logline string1 string2 are string)

logline:=FormatDateTime(‘dd-mmm- hh:nn:ss’,now) + string1 +
#13#10+#9 + string2;
ShowMessage(logline);
Button1.hint:=logline;
writeln(f,logline);

these work D5-D2007   - ie a 2 line messagebox text, 2 line hint,
and 2 lines written to a log file.
is this still going to work?

do carriage returns/tabs/other control characters have to be defined
differently, eg as constants?

Answer - not figured out yet - anyone else know?

Q8 – stringlist1.loadfromfile(‘Test1.txt’);
what happens if this file is ascii text being read into a stringlist
which is unicode strings.

Answer - Default is Ascii text for loadfromfile and savetofile, use 
overloaded routines for Unicode

Q9 -   stringlist1.savetofile(‘Test1.txt’)
 presumably this is no longer ascii text.   How do I save and read a
stringlist to/from a file if it is to be Ansi text?

Q10 – If there are complexities in Q8 and Q9 is there a TAnsiStringlist
type (for ansistrings) as well as a unicode TStringlist type?
(I use stringlists a lot)

Answer - unicodestring lists can save to ascii or unicode files, so 
TAnsiStringlist not needed.

Q11 – do inifiles become unicode too?

Answer - looks like no?  Not clear?  Anyone else know?

Q12 – does Windows Notepad open unicode text files correctly?   or can it
only be used on Ansi text files?

Anyone know this?

Q13 - It looks like most programmers editors read and write ascii and
unicode encoding.the one I use seems to distinguish between UTF-8 and
unicode as well – what is the difference?

Anyone know this?

John

___
NZ Borland Developers Group - Delphi mailing list
Post: delphi@delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: 
unsubscribe 

___
NZ Borland Developers Group - Delphi mailing list
Post: delphi@delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: 
unsubscribe

___
NZ Borland

Re: [DUG] Upgrading to XE - Unicode strings questions

2010-11-22 Thread Jolyon Smith

 that definition will suffice).

UTF8/16/32 are different *encodings* for that character set.  For UTF16 and 
UTF32 there are also Big and Little Endianed variants.

As noted before, in Notepad, and possibly in other apps, the term “Unicode” 
denotes “UTF16”.

UTF32 is rarely encountered in the wild, which might explain why there is no 
TEncoding support for it (and indeed why Notepad doesn’t appear to support it).


As far as the difference between ASCII and UTF8 encoded Unicode goes:

An ASCII file can represent only characters 0..128 and each character is 
certain to occupy a single byte.

A UTF8 file can represent *EVERY* Unicode character, not just ASCII, but 
characters with codepoints  127 will occupy 2 or more bytes.

You may have spotted that for an ASCII file, ASCII and UTF8 encoding are 
physically indistinguishable at the character data level.  However, a *true* 
UTF8 file (as opposed to an ASCII file that could be treated naively as UTF8 – 
or vice versa) will have a BOM (Byte Order Marker).

A BOM is a sequence of bytes that is prepended to a file (or stream) to 
indicate the Unicode encoding and identify the byte order for those encodings 
that have big/little endian variants.



I hope that all helps a little.


:-)






-Original Message-
From: delphi-boun...@delphi.org.nz [mailto:delphi-boun...@delphi.org.nz] On 
Behalf Of John Bird
Sent: Tuesday, 23 November 2010 13:04
To: NZ Borland Developers Group - Delphi List
Subject: Re: [DUG] Upgrading to XE - Unicode strings questions

Thanks for the references, so I can answer most of the questions now. 
Here is what I understand so far, if anyone has anything to add this will be 
useful!

Extra question:

It looks like code like

for i:=1 to length(string1) do
begin
DoSomethingWithOneChar(string1[i]);
end;

cannot be used reliably.   The problems are that length(string1) looks like 
it cannot be safely used - as unicode characters may include 2 codepoints 
and length(string1) highlights that there is a difference between the number 
of unicode characters in a string and the number of codepoints.   Still 
figuring out what is the best practice here, as I have quite a lot of string 
routines.   Should be be OK as long as the unicode text actually is ASCII.

Q2 – With XE do the .pas and .dfm files become unicode text and hence cannot
be read by earlier Delphi, eg D2007 any more?

Answer - Is a project option from what I have read?, yes not portable if 
unicode.

Q3 – I do a lot of reading ascii data files, and writing back.   Using
mainly TFilestream and stringlists.   Does this in general mean I will need
to use file variables declared as Ansichar and AnsiString instead of Char
and String?
(I would prefer to use the standard VCL where possible)

If I have variables
as1:Ansistring;
s2:string;

Q4 – if I do s2:=as1  does this convert ansistrings to unicode?

Answer - yes, there are performance issues to watch out for if conversion 
happens a lot.

Q5 – if I do as1:=s2 does this convert a unicode string to ansistring?

(otherwise how do I do this?)

Answer - yes, there are performance issues to watch out for if conversion 
happens a lot.

Q6 – I understand any code like

char1:=string1[i];
if char1 in [‘a’..’z’] then
begin
message:=string[i]+’ - character is lowercase’;
end

will break, as ansi characters are ordinal (less than 256 or 512)
and set comparisons ['a'..'z']  or ['a','b','c']can be used, this set
code cannot be used for unicode characters.   What is the replacement?

Answer - There is CharInSet call and numerous extra housekeeping functions 
added in TCharacter.

Q7 – do literals like  #13#10 still mean carriage return and linefeed?  #9
means tab?
if I have code like (logline string1 string2 are string)

logline:=FormatDateTime(‘dd-mmm- hh:nn:ss’,now) + string1 +
#13#10+#9 + string2;
ShowMessage(logline);
Button1.hint:=logline;
writeln(f,logline);

these work D5-D2007   - ie a 2 line messagebox text, 2 line hint,
and 2 lines written to a log file.
is this still going to work?

do carriage returns/tabs/other control characters have to be defined
differently, eg as constants?

Answer - not figured out yet - anyone else know?

Q8 – stringlist1.loadfromfile(‘Test1.txt’);
what happens if this file is ascii text being read into a stringlist
which is unicode strings.

Answer - Default is Ascii text for loadfromfile and savetofile, use 
overloaded routines for Unicode

Q9 -   stringlist1.savetofile(‘Test1.txt’)
 presumably this is no longer ascii text.   How do I save and read a
stringlist to/from a file if it is to be Ansi text?

Q10 – If there are complexities in Q8 and Q9 is there a TAnsiStringlist
type (for ansistrings) as well as a unicode TStringlist type?
(I use stringlists a lot)

Answer - unicodestring lists can save

Re: [DUG] Upgrading to XE - Unicode strings questions

2010-11-22 Thread Ross Levis

You beat me to it.  I was going to say the same, that I'm interested in these 
answers also.  I have customers all over the world and just recently the 
display of Chinese characters was desired in a non-Chinese speaking country.  
So eventually I'll have to convert to Unicode.

Ross.

-Original Message-
From: delphi-boun...@delphi.org.nz [mailto:delphi-boun...@delphi.org.nz] On 
Behalf Of David Brennan
Sent: Tuesday, 23 November 2010 1:27 PM
To: 'NZ Borland Developers Group - Delphi List'
Subject: Re: [DUG] Upgrading to XE - Unicode strings questions

Just thought I would chime in that I'm really interested in the answers to 
these questions too (Unicode being something we are also a bit apprehensive of).

-Original Message-
From: delphi-boun...@delphi.org.nz [mailto:delphi-boun...@delphi.org.nz] On 
Behalf Of John Bird
Sent: Tuesday, 23 November 2010 1:04 p.m.
To: NZ Borland Developers Group - Delphi List
Subject: Re: [DUG] Upgrading to XE - Unicode strings questions

Thanks for the references, so I can answer most of the questions now. 
Here is what I understand so far, if anyone has anything to add this will be 
useful!

Extra question:

It looks like code like

for i:=1 to length(string1) do
begin
DoSomethingWithOneChar(string1[i]);
end;

cannot be used reliably.   The problems are that length(string1) looks like 
it cannot be safely used - as unicode characters may include 2 codepoints 
and length(string1) highlights that there is a difference between the number 
of unicode characters in a string and the number of codepoints.   Still 
figuring out what is the best practice here, as I have quite a lot of string 
routines.   Should be be OK as long as the unicode text actually is ASCII.

Q2 – With XE do the .pas and .dfm files become unicode text and hence cannot
be read by earlier Delphi, eg D2007 any more?

Answer - Is a project option from what I have read?, yes not portable if 
unicode.

Q3 – I do a lot of reading ascii data files, and writing back.   Using
mainly TFilestream and stringlists.   Does this in general mean I will need
to use file variables declared as Ansichar and AnsiString instead of Char
and String?
(I would prefer to use the standard VCL where possible)

If I have variables
as1:Ansistring;
s2:string;

Q4 – if I do s2:=as1  does this convert ansistrings to unicode?

Answer - yes, there are performance issues to watch out for if conversion 
happens a lot.

Q5 – if I do as1:=s2 does this convert a unicode string to ansistring?

(otherwise how do I do this?)

Answer - yes, there are performance issues to watch out for if conversion 
happens a lot.

Q6 – I understand any code like

char1:=string1[i];
if char1 in [‘a’..’z’] then
begin
message:=string[i]+’ - character is lowercase’;
end

will break, as ansi characters are ordinal (less than 256 or 512)
and set comparisons ['a'..'z']  or ['a','b','c']can be used, this set
code cannot be used for unicode characters.   What is the replacement?

Answer - There is CharInSet call and numerous extra housekeeping functions 
added in TCharacter.

Q7 – do literals like  #13#10 still mean carriage return and linefeed?  #9
means tab?
if I have code like (logline string1 string2 are string)

logline:=FormatDateTime(‘dd-mmm- hh:nn:ss’,now) + string1 +
#13#10+#9 + string2;
ShowMessage(logline);
Button1.hint:=logline;
writeln(f,logline);

these work D5-D2007   - ie a 2 line messagebox text, 2 line hint,
and 2 lines written to a log file.
is this still going to work?

do carriage returns/tabs/other control characters have to be defined
differently, eg as constants?

Answer - not figured out yet - anyone else know?

Q8 – stringlist1.loadfromfile(‘Test1.txt’);
what happens if this file is ascii text being read into a stringlist
which is unicode strings.

Answer - Default is Ascii text for loadfromfile and savetofile, use 
overloaded routines for Unicode

Q9 -   stringlist1.savetofile(‘Test1.txt’)
 presumably this is no longer ascii text.   How do I save and read a
stringlist to/from a file if it is to be Ansi text?

Q10 – If there are complexities in Q8 and Q9 is there a TAnsiStringlist
type (for ansistrings) as well as a unicode TStringlist type?
(I use stringlists a lot)

Answer - unicodestring lists can save to ascii or unicode files, so 
TAnsiStringlist not needed.

Q11 – do inifiles become unicode too?

Answer - looks like no?  Not clear?  Anyone else know?

Q12 – does Windows Notepad open unicode text files correctly?   or can it
only be used on Ansi text files?

Anyone know this?

Q13 - It looks like most programmers editors read and write ascii and
unicode encoding.the one I use seems to distinguish between UTF-8 and
unicode as well – what is the difference?

Anyone know this?

John

Re: [DUG] Upgrading to XE - Unicode strings questions

2010-11-22 Thread Alister Christie

You should get a copy of Marco Cantu's Delphi 2009 Handbook - it has 
about 90 pages on Unicode in Delphi.  I think Bob Swart has a similar 
(less detailed) book.  There is also some videos from one of the 
CodeRage events (probably CodeRage 3 or 4).

Alister Christie
Computers for People
Ph: 04 471 1849 Fax: 04 471 1266
http://www.salespartner.co.nz
PO Box 13085
Johnsonville
Wellington


On 18/11/2010 5:48 p.m., John Bird wrote:
 Planning upgrading from D2007 to XE, but want to read up on issues I will
 need to consider first to do with strings becoming Unicode by default.   I
 recall the release of D2009 came with good white papers explaining
 ramifications, however I haven’t seen these as I haven’t upgraded.   Asked
 for such also at the XE event but have not been sent anything yet.

 I have a lot of code which I want to plan to be able to recompile easily,
 and would like to plan this migration.   I would prefer to put anything
 contentious or varying into a library unit, a ‘wrapper’ so that I don’t have
 to deal with these version differences in the main code...

 Anyone can answer any of these quick questions please post here or email
 me – thanks!

 Q1 - Anyone got some good references to read up on ansistring to unicode
 issues ?  Comprehensive please!

 Q2 – With XE do the .pas and .dfm files become unicode text and hence cannot
 be read by earlier Delphi, eg D2007 any more?

 Q3 – I do a lot of reading ascii data files, and writing back.   Using
 mainly TFilestream and stringlists.   Does this in general mean I will need
 to use file variables declared as Ansichar and AnsiString instead of Char
 and String?
 (I would prefer to use the standard VCL where possible)

 If I have variables
  as1:Ansistring;
  s2:string;

 Q4 – if I do s2:=as1  does this convert ansistrings to unicode?

 Q5 – if I do as1:=s2 does this convert a unicode string to ansstring?

  (otherwise how do I do this?)

 Q6 – I understand any code like

  char1:=string1[i];
  if char1 in [‘a’..’z’] then
  begin
  message:=string[i]+’ - character is lowercase’;
  end

  will break, as ansi characters are ordinal (less than 256 or 512)
 and set comparisons ['a'..'z']  or ['a','b','c']can be used, this set
 code cannot be used for unicode characters.   What is the replacement?


 Q7 – do literals like  #13#10 still mean carriage return and linefeed?  #9
 means tab?
  if I have code like (logline string1 string2 are string)

  logline:=FormatDateTime(‘dd-mmm- hh:nn:ss’,now) + string1 +
 #13#10+#9 + string2;
  ShowMessage(logline);
  Button1.hint:=logline;
  writeln(f,logline);

  these work D5-D2007   - ie a 2 line messagebox text, 2 line hint,
 and 2 lines written to a log file.
  is this still going to work?

  do carriage returns/tabs/other control characters have to be defined
 differently, eg as constants?

 Q8 – stringlist1.loadfromfile(‘Test1.txt’);
  what happens if this file is ascii text being read into a stringlist
 which is unicode strings.

 Q9 -   stringlist1.savetofile(‘Test1.txt’)
   presumably this is no longer ascii text.   How do I save and read a
 stringlist to/from a file if it is to be Ansi text?

 Q10 – If there are complexities in Q8 and Q9 is there a TAnsiStringlist
 type (for ansistrings) as well as a unicode TStringlist type?
  (I use stringlists a lot)

 Q11 – do inifiles become unicode too?

 Q12 – does Windows Notepad open unicode text files correctly?   or can it
 only be used on Ansi text files?

 Q13 - It looks like most programmers editors read and write ascii and
 unicode encoding.the one I use seems to distinguish between UTF-8 and
 unicode as well – what is the difference?

 John

 ___
 NZ Borland Developers Group - Delphi mailing list
 Post: delphi@delphi.org.nz
 Admin: http://delphi.org.nz/mailman/listinfo/delphi
 Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: 
 unsubscribe

___
NZ Borland Developers Group - Delphi mailing list
Post: delphi@delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: 
unsubscribe

Re: [DUG] Upgrading to XE - Unicode strings questions

2010-11-22 Thread Colin Johnsun

I won't answer everything but just on this one question:

On 23 November 2010 11:04, John Bird johnkb...@paradise.net.nz wrote:

 Extra question:

 It looks like code like

for i:=1 to length(string1) do
begin
DoSomethingWithOneChar(string1[i]);
end;

 cannot be used reliably.   The problems are that length(string1) looks like
 it cannot be safely used - as unicode characters may include 2 codepoints
 and length(string1) highlights that there is a difference between the
 number
 of unicode characters in a string and the number of codepoints.   Still
 figuring out what is the best practice here, as I have quite a lot of
 string
 routines.   Should be be OK as long as the unicode text actually is ASCII.


you can use something like this:

var
  C: Char;
...
  for C in String1 do
  begin
DoSomethingWithOneChar(C);
  end;

In this case you don't need to know the index of each character, you just
get the char using the for..in..do loop.
___
NZ Borland Developers Group - Delphi mailing list
Post: delphi@delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: 
unsubscribe

Re: [DUG] Upgrading to XE - Unicode strings questions

2010-11-22 Thread Stefan Mueller

Jolyon beat me to answer those questions .. but here are my additional 2 cents:

 

Q1: Unicode strings treat each character as 2 bytes - length returns the 
number of characters, not the size of memory allocated. Each access to it 
with an array syntax returns you a widechar instead of an ansichar. Your 
DoSomethingWithOneChar procedure will be called with a widechar as input but 
that probably won't cause any problems as widechar is a superset of ansichar so 
there won't be any issues when going in that direction.

 

Q8: stringlist.loadfromfile will auto-detect the encoding by looking for magic 
markers (BOM code 0xEF 0xBB 0xBF for UTF8 at the beginning of the file) and 
other things, like Unicode-codepoint encoding validity.

 

Q11: inifiles: yes, these files will now have support for Unicode too. 

 

Q13: Unicode is synonymous for “character encoding of the universal character 
set” – so it actually consists of two parts, the character set (about 109,000 
characters are officially defined) and the various encoding formats that are 
used to represent those characters (utf8/utf16/utf32/ucs2/ucs4/etc). Windows 
started with UCS-2 (in Windows NT) and then switched to UTF16. UCS-2 only 
allowed 65535 characters so Microsoft had to switch to UTF-16 in newer windows 
version to support the full character set. This means that some weird and/or no 
longer used characters from dead/historic languages can sometimes take up more 
than 2 bytes (the size of a widechar) – this isn’t usually an issue when 
developing Unicode enabled applications … unless your software needs to handle 
and display things like “cuneiform script” perfectly.

 


Kind Regards,
Stefan Mueller 
___
RD Manager
ORCL Toolbox LLP, Japan
http://www.orcl-toolbox.com http://www.orcl-toolbox.com/  

 

 

From: delphi-boun...@delphi.org.nz [mailto:delphi-boun...@delphi.org.nz] On 
Behalf Of Jolyon Smith
Sent: Tuesday, November 23, 2010 9:40 AM
To: 'NZ Borland Developers Group - Delphi List'
Subject: Re: [DUG] Upgrading to XE - Unicode strings questions

 

I'm guessing my response to your previous email didn't come thru for some 
reason - resending:

 

I shall address some of your questions that I can answer quickly:

 

Q2 – With XE do the .pas and .dfm files become unicode text and hence 

  cannot be read by earlier Delphi, eg D2007 any more?

I forget precisely which version of the IDE introduced the change, but the IDE 
has for some time supported different encodings for source/DFM files.  
Certainly this was present in D2006 and it may even have been as far back as D7 
or even earlier that it was introduced.

(Right click in source/dfm file and choose File Format from the context menu 
to see/change the file encoding)

Q3 – I do a lot of reading ascii data files, and writing back.   Using 

mainly TFilestream and stringlists.

Which TFileStream you should be OK, as long as you read/write into 
ANSIString/ANSIChar buffers as you already surmised.

With TStringList you are forced to push your data through a Unicode/ANSI 
conversion when reading/writing from/to ANSI files, since the TStringList 
itself holds UnicodeString items.  You can do this using the new Encoding 
parameter to the relevant methods of the class to ensure you read/write the 
correct/expected encoding (reading should correctly detect the encoding, but 
when writing you will need to be explicit).

Q4 – if I do s2:=as1  does this convert ansistrings to unicode?

Q5 – if I do as1:=s2 does this convert a unicode string to ansstring?

Yes, but you will get a warning when going from Unicode to ANSI (since not all 
ANSI encodings will support the possible content of a Unicode string).  To 
avoid this, be explicit with the conversion.

Q6 – I understand any code like

char1:=string1[i];

if char1 in [‘a’..’z’] then

begin

message:=string[i]+’ - character is lowercase’;

end

will break.

Nope, it's fine.  But again, you will get a warning, in this case that the 
WIDECHAR has been reduced to a BYTE (NOTE: not converted to ANSICHAR) and a 
suggestion that you use CharInSet() instead.

Note however that CharInSet contains no real magic that makes sets work for  
255 elements - it merely provides a wrapper around code that will avoid the 
suggestion that you use CharInSet().  You can achieve the same effect by again 
simply being explicit that you know that what you are doing is intended and 
safe by reducing the WideChar to an ANSIChar yourself:

  if ANSICHAR(char1) in ['a'..'z'] then

To my mind this is preferable to using CharInSet() as it makes it clearer in 
the code what is going on (that non-ANSIChars are not expected and may not be 
handled as intended).  Using CharInSet() won't make any material difference to 
the behaviour of the code, but it would make it less apparent what is going on 
(i.e. that your code deals specifically with ANSI chars).

CharInSet() performs a test

Re: [DUG] Upgrading to XE - Unicode strings questions

2010-11-22 Thread Jolyon Smith

Colin, the for C in loop and the for i := 1 to Length() loops are
functionally identical!  The only difference is that the for in version
incurs the slight overhead of the enumerator framework invoked by the
compiler and runtime magic to support that syntax.

 

But in neither case will the loop itself help detect/respond to surrogate
pairs (a single WideChar is potentially only ½ the data required to form a
complete character).  The only way to reduce an iterator over a string to
a simple char-wise loop, whether explicit or using enumerators, is to first
convert to UTF32, the facilities for which in the Delphi RTL are cough
rudimentary, to put it politely.  Non-existent may be nearer the mark.

 

The precise mechanics of the loop construct used is not material to that
problem.

 

 

However, just as before Unicode when most people didnt care and just wrote
code that assumed ANSI==ASCII, these days people wont care and will write
code that assumes that Unicode==BMP (Basic Multilingual Plane), ignoring
surrogate pairs just as they used to ignore extended ASCII and ANSI
characters.

 

And for most people, that will probably actually work.

 

J

 

 

From: delphi-boun...@delphi.org.nz [mailto:delphi-boun...@delphi.org.nz] On
Behalf Of Colin Johnsun
Sent: Tuesday, 23 November 2010 14:31
To: NZ Borland Developers Group - Delphi List
Subject: Re: [DUG] Upgrading to XE - Unicode strings questions

 

I won't answer everything but just on this one question:

On 23 November 2010 11:04, John Bird johnkb...@paradise.net.nz wrote:

Extra question:

It looks like code like

   for i:=1 to length(string1) do
   begin
   DoSomethingWithOneChar(string1[i]);
   end;

cannot be used reliably.   The problems are that length(string1) looks like
it cannot be safely used - as unicode characters may include 2 codepoints
and length(string1) highlights that there is a difference between the number
of unicode characters in a string and the number of codepoints.   Still
figuring out what is the best practice here, as I have quite a lot of string
routines.   Should be be OK as long as the unicode text actually is ASCII.

 

 

you can use something like this:

 

var

  C: Char;

...

  for C in String1 do

  begin

DoSomethingWithOneChar(C);

  end;

 

In this case you don't need to know the index of each character, you just
get the char using the for..in..do loop.

 

 

 

___
NZ Borland Developers Group - Delphi mailing list
Post: delphi@delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: 
unsubscribe

Re: [DUG] Upgrading to XE - Unicode strings questions

2010-11-22 Thread Colin Johnsun

Doh! Thanks Jolyon for clearing that misunderstanding on my part. I was
aware of the surrogate pair issue but I wrongly assumed that this might have
been taken care by the iterator implementation. I guess not.

Thanks again!
Cheers,
Colin

On 23 November 2010 13:06, Jolyon Smith jsm...@deltics.co.nz wrote:

 Colin, the for C in loop and the for i := 1 to Length() loops are
 functionally identical!  The only difference is that the “for in” version
 incurs the slight overhead of the enumerator framework invoked by the
 compiler and runtime magic to support that syntax.



 But in neither case will the loop itself help detect/respond to surrogate
 pairs (a single “WideChar” is potentially only ½ the data required to form a
 complete “*character*”).  The only way to reduce an iterator over a string
 to a simple char-wise loop, whether explicit or using enumerators, is to
 first convert to UTF32, the facilities for which in the Delphi RTL are
 cough rudimentary, to put it politely.  Non-existent may be nearer the
 mark.



 The precise mechanics of the loop construct used is not material to that
 problem.





 However, just as before Unicode when most people didn’t care and just wrote
 code that assumed ANSI==ASCII, these days people won’t care and will write
 code that assumes that Unicode==BMP (Basic Multilingual Plane), ignoring
 surrogate pairs just as they used to ignore extended ASCII and ANSI
 characters.



 And for most people, that will probably actually work.



 J





 *From:* delphi-boun...@delphi.org.nz [mailto:delphi-boun...@delphi.org.nz]
 *On Behalf Of *Colin Johnsun
 *Sent:* Tuesday, 23 November 2010 14:31
 *To:* NZ Borland Developers Group - Delphi List

 *Subject:* Re: [DUG] Upgrading to XE - Unicode strings questions



 I won't answer everything but just on this one question:

 On 23 November 2010 11:04, John Bird johnkb...@paradise.net.nz wrote:

 Extra question:

 It looks like code like

for i:=1 to length(string1) do
begin
DoSomethingWithOneChar(string1[i]);
end;

 cannot be used reliably.   The problems are that length(string1) looks like
 it cannot be safely used - as unicode characters may include 2 codepoints
 and length(string1) highlights that there is a difference between the
 number
 of unicode characters in a string and the number of codepoints.   Still
 figuring out what is the best practice here, as I have quite a lot of
 string
 routines.   Should be be OK as long as the unicode text actually is ASCII.





 you can use something like this:



 var

   C: Char;

 ...

   for C in String1 do

   begin

 DoSomethingWithOneChar(C);

   end;



 In this case you don't need to know the index of each character, you just
 get the char using the for..in..do loop.







 ___
 NZ Borland Developers Group - Delphi mailing list
 Post: delphi@delphi.org.nz
 Admin: http://delphi.org.nz/mailman/listinfo/delphi
 Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject:
 unsubscribe

___
NZ Borland Developers Group - Delphi mailing list
Post: delphi@delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: 
unsubscribe

Re: [DUG] Upgrading to XE - Unicode strings questions

2010-11-22 Thread John Bird

My main remaining question is the best way to handle code that up to now 
looked like:

for i:=1 to length(string1) do
begin
DoSomethingWithOneChar(string1[i]);
end;

If I got the gist correctly, string1[i] is one unicode character, but 
length(string1) is the number of codepoints in the string and not the number 
of characters.  This is gonna be confusing!

Other comments:

Comment 1 - I saw quite a few commentators say that they in general approved 
of the way that the unicode had been implemented - everything that was ansi 
string before is now unicode consistently throughout the whole language and 
IDE, and in the main the only code that needs altering is where Delphi is 
communicating outside the standard language:   ie

-DLL calls
-SavetoFile and LoadFromFile and other file access - even here smart 
defaults have been put in to retain expected behaviour.
-Sending strings to COM/TCP etc you might need to convert to get the kind 
expected
-Database fields - usually handled by making sure the right encoding is 
sent.

Comment 2 - The worst inconveniences are for those who have already tried to 
do some unicode type processing using WideChar, and the functions that were 
used for these.Undoing these changes is usually the best way to cater 
for unicode.Also some of the routines introduced then have horribly 
confusing names,  like AnsiPos   which is for searching widechars and is 
still what should be used for searching.It seems to me that some 
identical routines should be introduced - eg called UnicodePos(.) 
just so that those who are new to Unicode can use at least a consistently 
named set of tools.I would probably make routines named like this which 
I use just to be clear.

Comment 3 - I see a few people arguing that there should have been a 
compiler switch to allow compiling to ansistring  or unicode string 
depending on the compiler switch, to ease converting people to D2009/XE. 
There are merits either way on this - in the long term if everyone is going 
to have to live in a unicode world then its probably better to bite the 
bullet and be made to convert code as eventually you cannot escape it.   In 
such a case a simpler compiler and VCL is a big advantage. This is sort 
of related to being able to cross compile to 64 bit, iPhone, Android - 
whatever way makes it easy to have these forward looking options.The 
quite stark reality is that in 5 years it looks like much but not all 
commercial software will be running on Windows,  its likely to be a mix of 
Web/iPhone/Android/GoogleOS/MacOS   so the forwards portability of compiling 
Delphi for different environments is way more important than whether it 
should be able to do Strings as  AnsiString.

Comment 4 - Has anyone at Embarcadero considered 2 ways to make cross 
platform?option A is to go for a native compiler for different OS's - 
best if can be done.   option B is the Java route - compile to intermediate 
code for a Delphi Virtual Machine which can run interpreted with a runtime 
on many OS's.   Could be called the Delphi Virtual VCL Machine.   The reason 
why this might be a good way to go is that Delphi was originally designed as 
a teaching language - ie formally very strongly typed and formally well 
structured language- it could be about the best candidate around for 
generalised compiling and a simple cross platform runtime. Also with 
Java now owned by Oracle there is questions over if it has such a bright 
future and there is room for another similar approach.   DotNet is a similar 
idea too, but will only ever really be Windows.   A Delphi Virtual Machine 
might not matter too much if its slower if its portable.

[But I digress - The last point is way off topic for Unicode however]

Comment and question 5 - What is the status of Free Pascal/Lazarus  wrt to 
unicode?Does Delphi XE code port or not to Free Pascal?Its an issue 
to consider as well.


___
NZ Borland Developers Group - Delphi mailing list
Post: delphi@delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: 
unsubscribe

Re: [DUG] Upgrading to XE - Unicode strings questions

2010-11-22 Thread Jolyon Smith

@ Colin : No worries.

 

 

@All :  One other thing to point out is that when working with genuine,
actual Unicode strings you should be careful to use the correct ANSI()
functions...  yes, you read that right.

 

S := Uppercase(S);

 

Will NOT convert Unicode characters (just as it would previously have not
converted non-ASCII characters).

 

   S := ANSIUppercase( S );

 

On the other hand will.  The same goes for the likes of SameText() vs
ANSISameText() etc.

 

If you were writing for extended character sets in the past you were most
likely already using these routines, but if you werent (perhaps because
Delphi doesnt support extended chars very well) and are now thinking that
by simply upgrading to a Unicode Delphi all such things are magically taken
care of, you will be in for a shock.

 

 

Better yet, use the routines introduced in the Character unit (why not
UnicodeUtils?  DOH!)

 

The only problem you then have is if you want to write string
handling/manipulating code that will be portable between Unicode and
non-Unicode Delphi compilers.

 

 

 

From: delphi-boun...@delphi.org.nz [mailto:delphi-boun...@delphi.org.nz] On
Behalf Of Colin Johnsun
Sent: Tuesday, 23 November 2010 15:22
To: NZ Borland Developers Group - Delphi List
Subject: Re: [DUG] Upgrading to XE - Unicode strings questions

 

Doh! Thanks Jolyon for clearing that misunderstanding on my part. I was
aware of the surrogate pair issue but I wrongly assumed that this might have
been taken care by the iterator implementation. I guess not.

 

Thanks again!

Cheers,

Colin

 

On 23 November 2010 13:06, Jolyon Smith jsm...@deltics.co.nz wrote:

Colin, the for C in loop and the for i := 1 to Length() loops are
functionally identical!  The only difference is that the for in version
incurs the slight overhead of the enumerator framework invoked by the
compiler and runtime magic to support that syntax.

 

But in neither case will the loop itself help detect/respond to surrogate
pairs (a single WideChar is potentially only ½ the data required to form a
complete character).  The only way to reduce an iterator over a string to
a simple char-wise loop, whether explicit or using enumerators, is to first
convert to UTF32, the facilities for which in the Delphi RTL are cough
rudimentary, to put it politely.  Non-existent may be nearer the mark.

 

The precise mechanics of the loop construct used is not material to that
problem.

 

 

However, just as before Unicode when most people didnt care and just wrote
code that assumed ANSI==ASCII, these days people wont care and will write
code that assumes that Unicode==BMP (Basic Multilingual Plane), ignoring
surrogate pairs just as they used to ignore extended ASCII and ANSI
characters.

 

And for most people, that will probably actually work.

 

J

 

 

From: delphi-boun...@delphi.org.nz [mailto:delphi-boun...@delphi.org.nz] On
Behalf Of Colin Johnsun
Sent: Tuesday, 23 November 2010 14:31
To: NZ Borland Developers Group - Delphi List


Subject: Re: [DUG] Upgrading to XE - Unicode strings questions

 

I won't answer everything but just on this one question:

On 23 November 2010 11:04, John Bird johnkb...@paradise.net.nz wrote:

Extra question:

It looks like code like

   for i:=1 to length(string1) do
   begin
   DoSomethingWithOneChar(string1[i]);
   end;

cannot be used reliably.   The problems are that length(string1) looks like
it cannot be safely used - as unicode characters may include 2 codepoints
and length(string1) highlights that there is a difference between the number
of unicode characters in a string and the number of codepoints.   Still
figuring out what is the best practice here, as I have quite a lot of string
routines.   Should be be OK as long as the unicode text actually is ASCII.

 

 

you can use something like this:

 

var

  C: Char;

...

  for C in String1 do

  begin

DoSomethingWithOneChar(C);

  end;

 

In this case you don't need to know the index of each character, you just
get the char using the for..in..do loop.

 

 

 


___
NZ Borland Developers Group - Delphi mailing list
Post: delphi@delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject:
unsubscribe

 

___
NZ Borland Developers Group - Delphi mailing list
Post: delphi@delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: 
unsubscribe

Re: [DUG] Upgrading to XE - Unicode strings questions

2010-11-22 Thread John Bird

?As I understand it iterating over a string with Chars does get around the 
problem of surrogate pairs, as any character you are currently on might be 
either 1,2 or more bytes if it contains surrogate pairs, but just one unicode 
character.   So if one is after iterating over the characters in the string 
your code should be perfect.

My question is if you are not using   for C in String1 do and want to use   
for i:=1 to length(string1) do

what do  you use instead of length to get the number of characters in the 
string in general?  length is not the number of characters, its the umber of 
code-points (including surrogate pairs counted separately)  if I understand 
correctly.

Separate issue - I understand that if one wants to iterate over the bytes of a 
string then one uses byte rather than char, and then one does have to 
investigate each byte to see if it is part of a surrogate pair.  There look to 
be routines for this – however I am guessing most won’t be needing to do this. 
Fortunately!


Also – I think  getting what we used to call the ASCII value of a character, or 
creating a character still works the same-  in fact for english alphabet the 
codes are the same I understand?  Can someone confirm.   (ie the character 
might use 2 bytes if encoded as unicode string, but the value stored for ‘A’ is 
still 41 hex or 65 decimal.   Which means I think that one can do


code1,code2:integer;
char1:ansichar;
char2:char;

char1:=’A’;
char2:=’A’;//unicode char 2 bytes
code1:=ord(char1);
code2:=ord(char2);

in this case I think code1=code2 ??  anyone confirm this.   Of course once one 
goes away from English/latin 8859 characters this is no longer going to be true.



John
 
Doh! Thanks Jolyon for clearing that misunderstanding on my part. I was aware 
of the surrogate pair issue but I wrongly assumed that this might have been 
taken care by the iterator implementation. I guess not. 

Thanks again!
Cheers,
Colin

On 23 November 2010 13:06, Jolyon Smith jsm...@deltics.co.nz wrote:

  Colin, the for C in loop and the for i := 1 to Length() loops are 
functionally identical!  The only difference is that the “for in” version 
incurs the slight overhead of the enumerator framework invoked by the compiler 
and runtime magic to support that syntax.



  But in neither case will the loop itself help detect/respond to surrogate 
pairs (a single “WideChar” is potentially only ½ the data required to form a 
complete “character”).  The only way to reduce an iterator over a string to a 
simple char-wise loop, whether explicit or using enumerators, is to first 
convert to UTF32, the facilities for which in the Delphi RTL are cough 
rudimentary, to put it politely.  Non-existent may be nearer the mark.



  The precise mechanics of the loop construct used is not material to that 
problem.





  However, just as before Unicode when most people didn’t care and just wrote 
code that assumed ANSI==ASCII, these days people won’t care and will write code 
that assumes that Unicode==BMP (Basic Multilingual Plane), ignoring surrogate 
pairs just as they used to ignore extended ASCII and ANSI characters.



  And for most people, that will probably actually work.



  J





  From: delphi-boun...@delphi.org.nz [mailto:delphi-boun...@delphi.org.nz] On 
Behalf Of Colin Johnsun
  Sent: Tuesday, 23 November 2010 14:31
  To: NZ Borland Developers Group - Delphi List


  Subject: Re: [DUG] Upgrading to XE - Unicode strings questions


  I won't answer everything but just on this one question:

  On 23 November 2010 11:04, John Bird johnkb...@paradise.net.nz wrote:

  Extra question:

  It looks like code like

 for i:=1 to length(string1) do
 begin
 DoSomethingWithOneChar(string1[i]);
 end;

  cannot be used reliably.   The problems are that length(string1) looks like
  it cannot be safely used - as unicode characters may include 2 codepoints
  and length(string1) highlights that there is a difference between the number
  of unicode characters in a string and the number of codepoints.   Still
  figuring out what is the best practice here, as I have quite a lot of string
  routines.   Should be be OK as long as the unicode text actually is ASCII.





  you can use something like this:



  var

C: Char;

  ...

for C in String1 do

begin

  DoSomethingWithOneChar(C);

end;



  In this case you don't need to know the index of each character, you just get 
the char using the for..in..do loop.








  ___
  NZ Borland Developers Group - Delphi mailing list
  Post: delphi@delphi.org.nz
  Admin: http://delphi.org.nz/mailman/listinfo/delphi
  Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: 
unsubscribe





___
NZ Borland Developers Group - Delphi mailing list
Post: delphi

Re: [DUG] Upgrading to XE - Unicode strings questions

2010-11-22 Thread Jolyon Smith

No on two counts:

String[1] is one WIDE character, which may or may not be a complete Unicode 
codePOINT (and so equally may or may not be a complete Unicode character, 
although the definition of what constitutes a character in Unicode is a whole 
separate topic).



Length( s ) will always yield the number of chars in s.

The only wrinkle that Unicode introduces here is that the number of chars no 
longer == the number of *bytes* (each char is a WIDEChar and therefore 2 bytes).

But you can still reliably index each WIDEChar in a WIDEString using the [nth] 
element index.


Strings in COM have always been WideString - conversion to/from UnicodeString 
is automatic and lossless (in terms of data).

TCP, yes you will have to do work to support Unicode in this area if you 
haven't already done so (but the internet has - if not entirely, then in large 
part - been Unicode for a long time now, so you really should have taken care 
of this already, regardless of the Unicode-ness or otherwise of your Delphi 
code itself).

But that applies to ANY external systems with which your code interacts that 
may already be Unicode (or indeed which will remain resolutely ANSI, even if 
your app becomes Unicode).



In addition to inconveniences for people who had already done some work to 
support Unicode, the implementation does little/nothing to encourage or promote 
*correct* Unicode support in new projects and introduces potential for 
confusion and mistakes in many areas imho.

The entire string handling area of the RTL should have been thrown out and a 
properly thought out framework introduced to replace it, and yes, we should 
have been forced to migrate to the new, consistent and comprehensive string RTL 
(or at least encouraged, by marking all existing RTL support as deprecated).

PLUS, for the backwards compatability crowd, they *could* have supported a 
String == Unicode compiler switch imho (not just an I wish they had - I can 
see technically precisely HOW it would and could have been implemented, and it 
fits perfectly with their own advice for how to deal with code that is 
problematic to convert to Unicode).

Whilst at a technical level this may not have been a huge advantage, it 
certainly would have been a welcome comfort to people facing the job of 
converting large applications with libraries of - in some cases no longer 
supported - 3rd party library code, by enabling them to flag those units as 
ANSI and deal with the conversion warnings that would have subsequently been 
emitted by linking with the Unicode VCL.

The only real argument against a compiler switch comes from the view that 
having two versions of the VCL - one Unicode and one ANSI - would have been 
required and would have been unworkable.  This is not the case IMHO.  The VCL 
could have gone unilaterally and fixedly String==UnicodeString whilst allowing 
us to compile our own units with String==ANSI/UnicodeString

As I say, the technique of enforcing ANSI-ness in unsafe Unicode units in 
order to defer the job of migrating those units to Unicode is well documented 
and is the official advice in such difficult cases.

A compiler switch as I envisage it would simply have made that process more 
straightforward - the net effect would have been the same, which on its own 
demonstrates that such a switch was in fact technically possibly IF IMPLEMENTED 
IN THAT WAY, despite the protestations to the contrary (which assume a 
DIFFERENT implementation approach).

Too late now of course.

:)




-Original Message-
From: delphi-boun...@delphi.org.nz [mailto:delphi-boun...@delphi.org.nz] On 
Behalf Of John Bird
Sent: Tuesday, 23 November 2010 15:36
To: NZ Borland Developers Group - Delphi List
Subject: Re: [DUG] Upgrading to XE - Unicode strings questions

My main remaining question is the best way to handle code that up to now 
looked like:

for i:=1 to length(string1) do
begin
DoSomethingWithOneChar(string1[i]);
end;

If I got the gist correctly, string1[i] is one unicode character, but 
length(string1) is the number of codepoints in the string and not the number 
of characters.  This is gonna be confusing!

Other comments:

Comment 1 - I saw quite a few commentators say that they in general approved 
of the way that the unicode had been implemented - everything that was ansi 
string before is now unicode consistently throughout the whole language and 
IDE, and in the main the only code that needs altering is where Delphi is 
communicating outside the standard language:   ie

-DLL calls
-SavetoFile and LoadFromFile and other file access - even here smart 
defaults have been put in to retain expected behaviour.
-Sending strings to COM/TCP etc you might need to convert to get the kind 
expected
-Database fields - usually handled by making sure the right encoding is 
sent.

Comment 2 - The worst inconveniences are for those who have already tried to 
do some unicode type processing using WideChar

Re: [DUG] Upgrading to XE - Unicode strings questions

2010-11-22 Thread Todd

Hi John
 Extra question:

 It looks like code like

  for i:=1 to length(string1) do
  begin
  DoSomethingWithOneChar(string1[i]);
  end;

 cannot be used reliably.

I think the solution here is to not to concentrate so much on unicode, 
but rather on what DoSomethingWithOneChar() is trying to achieve.
Does the function even make sense for non-ANSI characters?

Todd.

 The problems are that length(string1) looks like
 it cannot be safely used - as unicode characters may include 2 codepoints
 and length(string1) highlights that there is a difference between the number
 of unicode characters in a string and the number of codepoints.   Still
 figuring out what is the best practice here, as I have quite a lot of string
 routines.   Should be be OK as long as the unicode text actually is ASCII.

 Q2 – With XE do the .pas and .dfm files become unicode text and hence cannot
 be read by earlier Delphi, eg D2007 any more?

 Answer - Is a project option from what I have read?, yes not portable if
 unicode.

 Q3 – I do a lot of reading ascii data files, and writing back.   Using
 mainly TFilestream and stringlists.   Does this in general mean I will need
 to use file variables declared as Ansichar and AnsiString instead of Char
 and String?
 (I would prefer to use the standard VCL where possible)

 If I have variables
  as1:Ansistring;
  s2:string;

 Q4 – if I do s2:=as1  does this convert ansistrings to unicode?

 Answer - yes, there are performance issues to watch out for if conversion
 happens a lot.

 Q5 – if I do as1:=s2 does this convert a unicode string to ansistring?

  (otherwise how do I do this?)

 Answer - yes, there are performance issues to watch out for if conversion
 happens a lot.

 Q6 – I understand any code like

  char1:=string1[i];
  if char1 in [‘a’..’z’] then
  begin
  message:=string[i]+’ - character is lowercase’;
  end

  will break, as ansi characters are ordinal (less than 256 or 512)
 and set comparisons ['a'..'z']  or ['a','b','c']can be used, this set
 code cannot be used for unicode characters.   What is the replacement?

 Answer - There is CharInSet call and numerous extra housekeeping functions
 added in TCharacter.

 Q7 – do literals like  #13#10 still mean carriage return and linefeed?  #9
 means tab?
  if I have code like (logline string1 string2 are string)

  logline:=FormatDateTime(‘dd-mmm- hh:nn:ss’,now) + string1 +
 #13#10+#9 + string2;
  ShowMessage(logline);
  Button1.hint:=logline;
  writeln(f,logline);

  these work D5-D2007   - ie a 2 line messagebox text, 2 line hint,
 and 2 lines written to a log file.
  is this still going to work?

  do carriage returns/tabs/other control characters have to be defined
 differently, eg as constants?

 Answer - not figured out yet - anyone else know?

 Q8 – stringlist1.loadfromfile(‘Test1.txt’);
  what happens if this file is ascii text being read into a stringlist
 which is unicode strings.

 Answer - Default is Ascii text for loadfromfile and savetofile, use
 overloaded routines for Unicode

 Q9 -   stringlist1.savetofile(‘Test1.txt’)
   presumably this is no longer ascii text.   How do I save and read a
 stringlist to/from a file if it is to be Ansi text?

 Q10 – If there are complexities in Q8 and Q9 is there a TAnsiStringlist
 type (for ansistrings) as well as a unicode TStringlist type?
  (I use stringlists a lot)

 Answer - unicodestring lists can save to ascii or unicode files, so
 TAnsiStringlist not needed.

 Q11 – do inifiles become unicode too?

 Answer - looks like no?  Not clear?  Anyone else know?

 Q12 – does Windows Notepad open unicode text files correctly?   or can it
 only be used on Ansi text files?

 Anyone know this?

 Q13 - It looks like most programmers editors read and write ascii and
 unicode encoding.the one I use seems to distinguish between UTF-8 and
 unicode as well – what is the difference?

 Anyone know this?

 John

 ___
 NZ Borland Developers Group - Delphi mailing list
 Post: delphi@delphi.org.nz
 Admin: http://delphi.org.nz/mailman/listinfo/delphi
 Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject:
 unsubscribe

 ___
 NZ Borland Developers Group - Delphi mailing list
 Post: delphi@delphi.org.nz
 Admin: http://delphi.org.nz/mailman/listinfo/delphi
 Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: 
 unsubscribe

___
NZ Borland Developers Group - Delphi mailing list
Post: delphi@delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: 
unsubscribe

Re: [DUG] Upgrading to XE - Unicode strings questions

2010-11-22 Thread Jolyon Smith

As I understand it iterating over a string with Chars does get around the
problem of surrogate pairs

 

It depends what you mean by get around the problem.

 

for c in string do WorkWith( c );

 

Will iterate once for each c (WIDECHAR) in s.  Some of those c's may be in
surrogate pairs, but you will get only 1 of each half of each pair at a
time.  So if your WorkWith() routine simply ignores surrogate pairs then
yes, you got around the problem.   But if WorkWith() needs to work on
discrete codepoints beyond the BMP then you have some extra work to do
before you can call WorkWith(), and you must call it with a UTF32 parameter,
NOT a UTF16 WideChar (unless WorkWith() has some way of keeping track of
calls made to it, and doing the job of combining surrogates for itself -
which is unlikely I think).

 

But crucially, for c in s is absolutely no different from:

 

for i := 1 to Length(s) do WorkWith( s[i] );

 

They do exactly the same thing - namely iterate over each widechar in the
string.

 

 

as any character you are currently on might be either 1,2 or more bytes if
it contains 

surrogate pairs, but just one unicode character

 

This makes no sense.  *Every* character (WIDECHAR) that you are on will be
2 bytes.  No more.  No Less.   The number of the bytes shall be 2, and 2
shall be the number.  What those 2 bytes represent may be either a complete
Unicode codepoint (in the BMP) or one of either a hi/lo char in a surrogate
pair, which must be combined to derive the codepoint they represent.

 

 

  what do  you use instead of length to get the number of characters
in the string in general?

 

Length(s) returns the number of WIDEChars. The number of n for which s[n]
is valid.

 

 

  length is not the number of characters, its the umber of 

code-points (including surrogate pairs counted separately)  if I 

understand correctly.

 

Nope - you understand incorrectly.  J

 

 

Separate issue - I understand that if one wants to iterate over the bytes of
a string 

then one uses byte rather than char, and then one does have
to investigate each byte 

to see if it is part of a surrogate pair.

 

No, this is what you have to do with WideChars in a string.  You use bytes
if you don't care about the characters at all and simply want to work with
the raw byte data.  Unlikely in the context of the questions you are asking
here, I would add.

___
NZ Borland Developers Group - Delphi mailing list
Post: delphi@delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: 
unsubscribe

Re: [DUG] Upgrading to XE - Unicode strings questions

2010-11-22 Thread Todd

Hi John
 Extra question:

 It looks like code like

  for i:=1 to length(string1) do
  begin
  DoSomethingWithOneChar(string1[i]);
  end;

 cannot be used reliably.

I think the solution here is not to concentrate on unicode vs widechar 
vs ansichar, but rather on what DoSomethingWithOneChar() is actually 
trying to achieve.

Does the function even make sense for non-ANSI characters? Only a more 
concrete example can be discussed with meaning.

Todd.

   The problems are that length(string1) looks like
 it cannot be safely used - as unicode characters may include 2 codepoints
 and length(string1) highlights that there is a difference between the number
 of unicode characters in a string and the number of codepoints.   Still
 figuring out what is the best practice here, as I have quite a lot of string
 routines.   Should be be OK as long as the unicode text actually is ASCII.

 Q2 – With XE do the .pas and .dfm files become unicode text and hence cannot
 be read by earlier Delphi, eg D2007 any more?

 Answer - Is a project option from what I have read?, yes not portable if
 unicode.

 Q3 – I do a lot of reading ascii data files, and writing back.   Using
 mainly TFilestream and stringlists.   Does this in general mean I will need
 to use file variables declared as Ansichar and AnsiString instead of Char
 and String?
 (I would prefer to use the standard VCL where possible)

 If I have variables
  as1:Ansistring;
  s2:string;

 Q4 – if I do s2:=as1  does this convert ansistrings to unicode?

 Answer - yes, there are performance issues to watch out for if conversion
 happens a lot.

 Q5 – if I do as1:=s2 does this convert a unicode string to ansistring?

  (otherwise how do I do this?)

 Answer - yes, there are performance issues to watch out for if conversion
 happens a lot.

 Q6 – I understand any code like

  char1:=string1[i];
  if char1 in [‘a’..’z’] then
  begin
  message:=string[i]+’ - character is lowercase’;
  end

  will break, as ansi characters are ordinal (less than 256 or 512)
 and set comparisons ['a'..'z']  or ['a','b','c']can be used, this set
 code cannot be used for unicode characters.   What is the replacement?

 Answer - There is CharInSet call and numerous extra housekeeping functions
 added in TCharacter.

 Q7 – do literals like  #13#10 still mean carriage return and linefeed?  #9
 means tab?
  if I have code like (logline string1 string2 are string)

  logline:=FormatDateTime(‘dd-mmm- hh:nn:ss’,now) + string1 +
 #13#10+#9 + string2;
  ShowMessage(logline);
  Button1.hint:=logline;
  writeln(f,logline);

  these work D5-D2007   - ie a 2 line messagebox text, 2 line hint,
 and 2 lines written to a log file.
  is this still going to work?

  do carriage returns/tabs/other control characters have to be defined
 differently, eg as constants?

 Answer - not figured out yet - anyone else know?

 Q8 – stringlist1.loadfromfile(‘Test1.txt’);
  what happens if this file is ascii text being read into a stringlist
 which is unicode strings.

 Answer - Default is Ascii text for loadfromfile and savetofile, use
 overloaded routines for Unicode

 Q9 -   stringlist1.savetofile(‘Test1.txt’)
   presumably this is no longer ascii text.   How do I save and read a
 stringlist to/from a file if it is to be Ansi text?

 Q10 – If there are complexities in Q8 and Q9 is there a TAnsiStringlist
 type (for ansistrings) as well as a unicode TStringlist type?
  (I use stringlists a lot)

 Answer - unicodestring lists can save to ascii or unicode files, so
 TAnsiStringlist not needed.

 Q11 – do inifiles become unicode too?

 Answer - looks like no?  Not clear?  Anyone else know?

 Q12 – does Windows Notepad open unicode text files correctly?   or can it
 only be used on Ansi text files?

 Anyone know this?

 Q13 - It looks like most programmers editors read and write ascii and
 unicode encoding.the one I use seems to distinguish between UTF-8 and
 unicode as well – what is the difference?

 Anyone know this?

 John

 ___
 NZ Borland Developers Group - Delphi mailing list
 Post: delphi@delphi.org.nz
 Admin: http://delphi.org.nz/mailman/listinfo/delphi
 Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject:
 unsubscribe

 ___
 NZ Borland Developers Group - Delphi mailing list
 Post: delphi@delphi.org.nz
 Admin: http://delphi.org.nz/mailman/listinfo/delphi
 Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: 
 unsubscribe

___
NZ Borland Developers Group - Delphi mailing list
Post: delphi@delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject:

Re: [DUG] Upgrading to XE - Unicode strings questions

2010-11-22 Thread Ross Levis

 Length( s ) will always yield the number of chars in s.

So how does one obtain the number of bytes in a string if one wants to use 
AnsiChar to check every character?

Does s[0] work?

Ross.



___
NZ Borland Developers Group - Delphi mailing list
Post: delphi@delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: 
unsubscribe

[DUG] Upgrading to XE - Unicode strings questions

2010-11-17 Thread John Bird

Planning upgrading from D2007 to XE, but want to read up on issues I will 
need to consider first to do with strings becoming Unicode by default.   I 
recall the release of D2009 came with good white papers explaining 
ramifications, however I haven’t seen these as I haven’t upgraded.   Asked 
for such also at the XE event but have not been sent anything yet.

I have a lot of code which I want to plan to be able to recompile easily, 
and would like to plan this migration.   I would prefer to put anything 
contentious or varying into a library unit, a ‘wrapper’ so that I don’t have 
to deal with these version differences in the main code...

Anyone can answer any of these quick questions please post here or email 
me – thanks!

Q1 - Anyone got some good references to read up on ansistring to unicode 
issues ?  Comprehensive please!

Q2 – With XE do the .pas and .dfm files become unicode text and hence cannot 
be read by earlier Delphi, eg D2007 any more?

Q3 – I do a lot of reading ascii data files, and writing back.   Using 
mainly TFilestream and stringlists.   Does this in general mean I will need 
to use file variables declared as Ansichar and AnsiString instead of Char 
and String?
(I would prefer to use the standard VCL where possible)

If I have variables
as1:Ansistring;
s2:string;

Q4 – if I do s2:=as1  does this convert ansistrings to unicode?

Q5 – if I do as1:=s2 does this convert a unicode string to ansstring?

(otherwise how do I do this?)

Q6 – I understand any code like

char1:=string1[i];
if char1 in [‘a’..’z’] then
begin
message:=string[i]+’ - character is lowercase’;
end

will break, as ansi characters are ordinal (less than 256 or 512) 
and set comparisons ['a'..'z']  or ['a','b','c']can be used, this set 
code cannot be used for unicode characters.   What is the replacement?


Q7 – do literals like  #13#10 still mean carriage return and linefeed?  #9 
means tab?
if I have code like (logline string1 string2 are string)

logline:=FormatDateTime(‘dd-mmm- hh:nn:ss’,now) + string1 + 
#13#10+#9 + string2;
ShowMessage(logline);
Button1.hint:=logline;
writeln(f,logline);

these work D5-D2007   - ie a 2 line messagebox text, 2 line hint, 
and 2 lines written to a log file.
is this still going to work?

do carriage returns/tabs/other control characters have to be defined 
differently, eg as constants?

Q8 – stringlist1.loadfromfile(‘Test1.txt’);
what happens if this file is ascii text being read into a stringlist 
which is unicode strings.

Q9 -   stringlist1.savetofile(‘Test1.txt’)
 presumably this is no longer ascii text.   How do I save and read a 
stringlist to/from a file if it is to be Ansi text?

Q10 – If there are complexities in Q8 and Q9 is there a TAnsiStringlist 
type (for ansistrings) as well as a unicode TStringlist type?
(I use stringlists a lot)

Q11 – do inifiles become unicode too?

Q12 – does Windows Notepad open unicode text files correctly?   or can it 
only be used on Ansi text files?

Q13 - It looks like most programmers editors read and write ascii and 
unicode encoding.the one I use seems to distinguish between UTF-8 and 
unicode as well – what is the difference?

John

___
NZ Borland Developers Group - Delphi mailing list
Post: delphi@delphi.org.nz
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: 
unsubscribe

Re: [DUG] Upgrading to XE - Unicode strings questions

Re: [DUG] Upgrading to XE - Unicode strings questions

Re: [DUG] Upgrading to XE - Unicode strings questions

Re: [DUG] Upgrading to XE - Unicode strings questions

Re: [DUG] Upgrading to XE - Unicode strings questions

Re: [DUG] Upgrading to XE - Unicode strings questions

Re: [DUG] Upgrading to XE - Unicode strings questions

Re: [DUG] Upgrading to XE - Unicode strings questions

Re: [DUG] Upgrading to XE - Unicode strings questions

Re: [DUG] Upgrading to XE - Unicode strings questions

Re: [DUG] Upgrading to XE - Unicode strings questions

Re: [DUG] Upgrading to XE - Unicode strings questions

Re: [DUG] Upgrading to XE - Unicode strings questions

Re: [DUG] Upgrading to XE - Unicode strings questions

Re: [DUG] Upgrading to XE - Unicode strings questions

Re: [DUG] Upgrading to XE - Unicode strings questions

Re: [DUG] Upgrading to XE - Unicode strings questions

Re: [DUG] Upgrading to XE - Unicode strings questions

Re: [DUG] Upgrading to XE - Unicode strings questions

Re: [DUG] Upgrading to XE - Unicode strings questions

Re: [DUG] Upgrading to XE - Unicode strings questions

Re: [DUG] Upgrading to XE - Unicode strings questions

Re: [DUG] Upgrading to XE - Unicode strings questions

Re: [DUG] Upgrading to XE - Unicode strings questions

Re: [DUG] Upgrading to XE - Unicode strings questions

Re: [DUG] Upgrading to XE - Unicode strings questions

[DUG] Upgrading to XE - Unicode strings questions

27 matches

Site Navigation

Mail list logo

Footer information