Re: [DUG] Upgrading to XE - Unicode strings questions
Iterating over a string is for the purpose of doing something with each individual character..whether it is a ‘A’ or a 'A' with a ^ (caret) on top of it. When I said the number of bytes in a character varies I was not meaning the number of bytes in a Char - I was meaning the total number of bytes in a one resulting character or letter might vary. For instance the word fiancee (with an acute on the last e) has 7 characters, the last of which might be 2 code units When I iterate over a string I ideally want to get one character in the word each time: could I build a string like this? setlength(String1,7); string1[1] := 'f'; string1[2] := 'i'; string1[3] := 'a'; string1[4] := 'n'; string1[5] := 'c'; string1[6] := 'e'; string1[7] := 'e';//I would want the full e acute here hence I want to be able to go for i :=1 to length(string1) do begin thisChar:=string1[i];//get each character one at a time listbox1.items.add('i=' + inttostr(i)+' character at position i = ' +ThisChar; end I would be expecting to see 7 characters, 7 lines in the list box, and length=7, with the last being e acute. Now everything Jolyon are saying and Cary also implies that this is not going to work. This looks to be a real nuisance! Now I think the e acute could be one unicode character (as there is likely to be a representation using one character, one code point and one code unit) or as one character, two code units, 2*2 bytes - a surrogate pair - where eg one supplies the e and one the acute. So it looks like what I see might vary according to how the e acute is encoded in the string? As I read further this gets murkier, as some of the things Cary Jensen says are not the same as what you say even if you say it emphatically! This is why I am thinking we have to understand clearly Unicode, and the Windows implementation of it.and I don't really yet. Here is what Cary Jensen says about a similar example with 7 characters, one of which is a surrogate pair: Although there are 7 characters in the printed string, the UnicodeString contains 8 code units, as returned by the Length function. Inspection of the 6th and 7th elements of the UnicodeString reveal the high and low surrogate values, each of which are code units. And, though the size of the UnicodeString is 16 bytes, ElementToCharLen accurately returns that there were a total of 7 code points in the string. While these answers suffice for surrogate pairs, unfortunately, things are not exactly the same when it comes to composite characters. Specifically, when a UnicodeString contains at least one composite character, that composite character may occupy two or more code units, though only one actual character will appear in the displayed string. Furthermore, ElementToCharLen is designed specifically to handle surrogate pairs, and not composite characters. Actually, composite characters introduce an issue of string normalization, which is not currently handled by Delphi's RTL (runtime library). When I asked Seppy Bloom about this, he replied that Microsoft has recently added normalization APIs (application programming interfaces) to some of the latest versions of Windows, ® including Windows® Vista, Windows® Server 2008, and Windows® 7. Seppy was also kind enough to offer a code sample of how you might count the number of characters in a UnicodeString that includes at least one composite character. I am including this code here for your benefit, but I must offer these cautions. First, this code has not been thoroughly tested, and has not been certified. If you use it, you do so at your own risk. Second, be aware that this code will not work on pre-Windows XP installations, and will only work with Windows XP if you have installed the Microsoft Internationalized Domain Names (IDN) Mitigation APIs 1.1. http://www.embarcadero.com/images/dm/technical-papers/delphi-unicode-migration.pdf Elsewhere he implies that Delphi can handle normalised strings for comparisons if one is careful, as in var s1, s2: String; begin ListBox1.Items.Clear; s1 := 'Hell'#$006F + #$0308' W'#$006F + #$0308'rld';//make using surrogate pairs s2 := 'Hellö Wörld'; ListBox1.Items.Add(s1); ListBox1.Items.Add(s2); ListBox1.Items.Add(BoolToStr(s1 = s2, True)); ListBox1.Items.Add(BoolToStr(AnsiCompareStr(s1, s2) = 0, True)); The contents of ListBox1 are shown in the following figure. Hellö Wörld Hellö Wörld False True Now I am not sure if the above example will show properly in email - because email text is generally limited to the ASCII characters and lists like this usually also restrict to text and not HTML emails. So as a related exercise I am curious whether the above example prints OK on the list..the words hello and world should have umlaut (..) over each o in case it doesn't arrive like that on the list. John As I understand it iterating over a string with Chars
Re: [DUG] Upgrading to XE - Unicode strings questions
John, I think you are confusing Canonical Normalized versions of the same Unicode string (in the example s1 is canonical, s2 is normalized) and the effect of local codepage conversion. Windows-1252 codepage (latin ISO 8859-1) has support for characters like the ö (ascii code #246) and é (ascii code #130). Converting to ansistring/ansichar on your system will take care of canonical Unicode representation and hence return true if you compare those strings. Please note that this only works because your system is set to a latin based codepage ... do the same on a Japanese version of windows and you'll get a very different result as there is no support for ö in ansistring under Japanese codepage! Because your system is Latin your first testcase/example of you building the word finance should actually work without problems - Joylon/Cary are probably wrong if they indeed implied that this wouldn't work. The ö can be written as a compound #$006F + #$0308 in canonical format ... and as #$00f6 in the normalized format. For most normal applications it just doesn't really matter either way because a user that is inputting text under his local codepage will always do it the same way and hence chances of you encountering a mix between canonical/normalized version will be close to zero. You only ever get issues if you cross codepage boundaries (like for example if you have users in different countries storing data in a database - which is why international databases often use UTF-8 to store data instead of their native charactersets). Most of the better databases (like for example Oracle) have built in support for sorting and handling canonical format and do the conversion automatically for you ... for someone writing desktop applications it usually just isn't an issue either way. Kind Regards, Stefan Mueller ___ RD Manager ORCL Toolbox LLP, Japan http://www.orcl-toolbox.com -Original Message- From: delphi-boun...@delphi.org.nz [mailto:delphi-boun...@delphi.org.nz] On Behalf Of John Bird Sent: Tuesday, November 23, 2010 7:33 PM To: NZ Borland Developers Group - Delphi List Subject: Re: [DUG] Upgrading to XE - Unicode strings questions Iterating over a string is for the purpose of doing something with each individual character..whether it is a ‘A’ or a 'A' with a ^ (caret) on top of it. When I said the number of bytes in a character varies I was not meaning the number of bytes in a Char - I was meaning the total number of bytes in a one resulting character or letter might vary. For instance the word fiancee (with an acute on the last e) has 7 characters, the last of which might be 2 code units When I iterate over a string I ideally want to get one character in the word each time: could I build a string like this? setlength(String1,7); string1[1] := 'f'; string1[2] := 'i'; string1[3] := 'a'; string1[4] := 'n'; string1[5] := 'c'; string1[6] := 'e'; string1[7] := 'e';//I would want the full e acute here hence I want to be able to go for i :=1 to length(string1) do begin thisChar:=string1[i];//get each character one at a time listbox1.items.add('i=' + inttostr(i)+' character at position i = ' +ThisChar; end I would be expecting to see 7 characters, 7 lines in the list box, and length=7, with the last being e acute. Now everything Jolyon are saying and Cary also implies that this is not going to work. This looks to be a real nuisance! Now I think the e acute could be one unicode character (as there is likely to be a representation using one character, one code point and one code unit) or as one character, two code units, 2*2 bytes - a surrogate pair - where eg one supplies the e and one the acute. So it looks like what I see might vary according to how the e acute is encoded in the string? As I read further this gets murkier, as some of the things Cary Jensen says are not the same as what you say even if you say it emphatically! This is why I am thinking we have to understand clearly Unicode, and the Windows implementation of it.and I don't really yet. Here is what Cary Jensen says about a similar example with 7 characters, one of which is a surrogate pair: Although there are 7 characters in the printed string, the UnicodeString contains 8 code units, as returned by the Length function. Inspection of the 6th and 7th elements of the UnicodeString reveal the high and low surrogate values, each of which are code units. And, though the size of the UnicodeString is 16 bytes, ElementToCharLen accurately returns that there were a total of 7 code points in the string. While these answers suffice for surrogate pairs, unfortunately, things are not exactly the same when it comes to composite characters. Specifically, when a UnicodeString contains at least one composite character, that composite character may occupy two or more code units, though only one
Re: [DUG] Upgrading to XE - Unicode strings questions
I think you are confusing Canonical Normalized versions of the same Unicode string (in the example s1 is canonical, s2 is normalized) and the effect of local codepage conversion. Yep, and for the record I think this is a big problem with the way Embarcadero implemented Unicode. By pursuing the Unicode is a no-brainer approach (facilitating easy migration for ASCII apps) they have obfuscated the fact that Unicode is far from simple. Or at least doing it right is. Danny Thorpe opined years ago that it made a lot of sense to do 64-bit and Unicode in one go as a big-bang breaking change, leaving the 32-bit, ANSI VCL product behind as a legacy platform. Danny Thorpe always was a clever guy! ;) The ö can be written as a compound #$006F + #$0308 in canonical format ... and as #$00f6 in the normalized format. For most normal applications it just doesn't really matter either way because a user that is inputting text under his local codepage will always do it the same way A user could specifically choose to enter that character in either form - this is unlikely, yes. Or, two users using the same codepage could choose to enter the character differently. Or if your data is coming from two separate external sources. The *only* way to be sure is to normalise before processing. You only ever get issues if you cross codepage boundaries (like for example if you have users in different countries storing data in a database - which is why international databases often use UTF-8 to store data instead of their native charactersets). This makes no sense at all to me. ö encoded as #$006F + #$0308 **OR** #$00f6 even in UTF-8. Whether you encode using UTF-8, UTF-16 or UTF-32, a single accented character codepoint vs a character followed by a diacritic are still two distinct character sequences. ___ NZ Borland Developers Group - Delphi mailing list Post: delphi@delphi.org.nz Admin: http://delphi.org.nz/mailman/listinfo/delphi Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: unsubscribe
Re: [DUG] Upgrading to XE - Unicode strings questions
John, the problem is that in Unicode single character is meaningless unless you have performed some pre-processing to GIVE that term some meaning. There are some standard forms for such processing, called Normalisations. The problem is that a single character to your eyes, e.g. an accented a, could be represented in a Unicode string in at least two ways: 1. A single codepoint represented that accented a 2. TWO codepoints - the first representing a and the second a diacritic codepoint for the accent Iterating over a string is for the purpose of doing something with each individual character That's fine, but in Unicode what you have is a string not of characters but of codepoints. The concept of a character is not synonymous with codepoint in Unicode in the same way that it is with ASCII or even ANSI. So you have compounded complications: a. Depending on encoding, a single codepoint (32-bit value) may be encoded in 1, 2, or more bytes. Each byte may represent a whole codepoint or only part of a codepoint encoding. b. Each codepoint may represent a whole character or only PART of a character encoding. Complication 'a' can be avoided by adopting UTF-32 encoding - 4 bytes for EVERY codepoint. That is hugely wasteful in terms of memory/storage for most applications. UTF-16 - the encoding used by Delphi and indeed by Windows natively itself - is a compromise. It is less efficient than ANSI for ASCII, but more efficient that UTF-32 for ANSI characters sets represented in the BMP. For applications working entirely in the BMP UTF-16 is also relatively easy to process - for NORMALISED strings, each codepoint IS a character (in the BMP). But for non-normalised data that is still not necessarily the case. could I build a string like this? setlength(String1,7); string1[1] := 'f'; string1[2] := 'i'; string1[3] := 'a'; string1[4] := 'n'; string1[5] := 'c'; string1[6] := 'e'; string1[7] := 'e';//I would want the full e acute here Yes, you can. But you might also *receive* from another source, a string that is apparently the same at the visual representation level, but different at the data level, where: string1[1] = 'f'; string1[2] = 'i'; string1[3] = 'a'; string1[4] = 'n'; string1[5] = 'c'; string1[6] = 'e'; string1[7] = 'e';// Normal 'e' character, i.e. identical to string1[6] string1[8] = U+0301; // Combining acute diacritic When displayed on screen this string will appear identical to your string, but it is represented in the data in a different way. hence I want to be able to go for i :=1 to length(string1) do begin .. end Now everything Jolyon are saying and Cary also implies that this is not going to work. This looks to be a real nuisance! I don't know what gave you that impression from what I said. Yes, Unicode is/can be a real nuisance - *properly* supporting it is a lot more work than people think - but what you want to do here can be done. Now I think the e acute could be one unicode character (as there is likely to be a representation using one character, one code point and one code unit) or as one character, two code units, 2*2 bytes - a surrogate pair - where eg one supplies the e and one the acute. NO!!! This is NOT what a surrogate pair is. A surrogate pair is encountered ONLY in UTF-16, and is found when you have a codepoint that is not in the BMP. i.e. a value 65535 that cannot be encoded in a 16-bit value. These are typically CJVK characters (Chinese/Japanese/Vietnamese/Korean) sometimes called Han or Kanji character sets. The first 16-bit value indicates a page in the non-BMP. The following 16-bit value then identifies an entry in that page. To obtain the codepoint that the PAIR of VALUES represents, you have to apply a transform, combining the page selector with the page entry. But what you get is a single codepoint. (you don't have to do this - there are routines to do it for you, but you have to invoke them as appropriate). A Surrogate Pair is a representation of a single codepoint, NOT a relationship between TWO codepoints. When you have a visual character encoded as a codepoint + a following, combining codepoint, that is simply TWO Unicode codepoints that are combined to form one VISUAL character. That is NOT a surrogate pair however. It is merely two codepoints that have to be combined. ___ NZ Borland Developers Group - Delphi mailing list Post: delphi@delphi.org.nz Admin: http://delphi.org.nz/mailman/listinfo/delphi Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: unsubscribe
Re: [DUG] Upgrading to XE - Unicode strings questions
Hi John You can find out whether a unicode string is inside the BMP by converting it to UTF-32 and checking that the new string is twice the length of the original (UTF-16) string. A user could specifically choose to enter that character in either form - this is unlikely, yes. Or, two users using the same codepage could choose to enter the character differently. Or if your data is coming from two separate external sources. The *only* way to be sure is to normalise before processing. Agreed. That will eliminate any issues with composite codepoints. You only ever get issues if you cross codepage boundaries (like for example if you have users in different countries storing data in a database - which is why international databases often use UTF-8 to store data instead of their native charactersets). This makes no sense at all to me. ö encoded as #$006F + #$0308 **OR** #$00f6 even in UTF-8. Whether you encode using UTF-8, UTF-16 or UTF-32, a single accented character codepoint vs a character followed by a diacritic are still two distinct character sequences. True. I think the point is that UTF-8 is the most compact format without data loss, regardless of whether the codepoints are composite or not. Todd. ___ NZ Borland Developers Group - Delphi mailing list Post: delphi@delphi.org.nz Admin: http://delphi.org.nz/mailman/listinfo/delphi Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: unsubscribe
Re: [DUG] Upgrading to XE - Unicode strings questions
?I read in one of the references that UTF-32 was a more common standard on Unix systems - which means I guess they have chosen the simplest format at the trade off of using more space? I think linux/Windows/MacOS use UTF-16 more commonly... Anyway for the time being, as long as the data in strings is unicode, but is still Latin 8859 (ie ASCII characters) I can without worrying too much iterate over a string one character at a time...using length. That was the main thing I wanted to know John ___ NZ Borland Developers Group - Delphi mailing list Post: delphi@delphi.org.nz Admin: http://delphi.org.nz/mailman/listinfo/delphi Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: unsubscribe
Re: [DUG] Upgrading to XE - Unicode strings questions
Anyway for the time being, as long as the data in strings is unicode, but is still Latin 8859 (ie ASCII characters) I can without worrying too much iterate over a string one character at a time...using length. Yep. But you are building an app that now supports Unicode. If your users are able to enter data into your app, your app will now *potentially* find itself handling Unicode data for which it was not designed, unless you take additional steps to now prevent a user from entering non-ASCII data in the first place. Previously you may not have taken these steps so theoretically could have found a user entering non-ASCII, ANSI characters too, except that in the past you would not have been using Unicode support as an advertised (or even unadvertised) feature of your app and could legitimately have told such users not to be so dumb (in not so many words, of course :D) This again is the danger of the no brainer approach with the Unicode migration in Delphi. By selling the idea that switching to Unicode was easy, they have just made it more confusing in many cases, imho. If I can just recompile and patch up a few warnings with some boilerplate, how come there's all this other stuff that I need to do too? I thought Unicode was supposed to make supporting this stuff easier. Answer: It does. It make supporting Unicode easier, but supporting Unicode is not, itself, easy. imho ___ NZ Borland Developers Group - Delphi mailing list Post: delphi@delphi.org.nz Admin: http://delphi.org.nz/mailman/listinfo/delphi Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: unsubscribe
Re: [DUG] Upgrading to XE - Unicode strings questions
It's a shame UTF-8 wasn't made the standard in Delphi. It's commonly used in audio file tags, for example, which I have to deal with. My software needs to search for songs with specific artists or titles, and it sounds like I'm going to have problems where the information is visually the same but entered differently in different parts of the world, using all sorts of 3rd party software. Ross. -Original Message- From: delphi-boun...@delphi.org.nz [mailto:delphi-boun...@delphi.org.nz] On Behalf Of Todd Sent: Wednesday, 24 November 2010 11:27 AM To: NZ Borland Developers Group - Delphi List Subject: Re: [DUG] Upgrading to XE - Unicode strings questions Hi John You can find out whether a unicode string is inside the BMP by converting it to UTF-32 and checking that the new string is twice the length of the original (UTF-16) string. A user could specifically choose to enter that character in either form - this is unlikely, yes. Or, two users using the same codepage could choose to enter the character differently. Or if your data is coming from two separate external sources. The *only* way to be sure is to normalise before processing. Agreed. That will eliminate any issues with composite codepoints. You only ever get issues if you cross codepage boundaries (like for example if you have users in different countries storing data in a database - which is why international databases often use UTF-8 to store data instead of their native charactersets). This makes no sense at all to me. ö encoded as #$006F + #$0308 **OR** #$00f6 even in UTF-8. Whether you encode using UTF-8, UTF-16 or UTF-32, a single accented character codepoint vs a character followed by a diacritic are still two distinct character sequences. True. I think the point is that UTF-8 is the most compact format without data loss, regardless of whether the codepoints are composite or not. Todd. ___ NZ Borland Developers Group - Delphi mailing list Post: delphi@delphi.org.nz Admin: http://delphi.org.nz/mailman/listinfo/delphi Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: unsubscribe ___ NZ Borland Developers Group - Delphi mailing list Post: delphi@delphi.org.nz Admin: http://delphi.org.nz/mailman/listinfo/delphi Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: unsubscribe
Re: [DUG] Upgrading to XE - Unicode strings questions
You should be fine - you just have to ensure you normalise the strings. You're going to have to convert from UTF-8 to UTF-16 to bring them in to your Delphi app anyway, for processing, so you may as well normalise them in the process. UTF-16 was chosen in Delphi because it is also the native encoding in Windows itself. -Original Message- From: delphi-boun...@delphi.org.nz [mailto:delphi-boun...@delphi.org.nz] On Behalf Of Ross Levis Sent: Wednesday, 24 November 2010 16:00 To: 'NZ Borland Developers Group - Delphi List' Subject: Re: [DUG] Upgrading to XE - Unicode strings questions It's a shame UTF-8 wasn't made the standard in Delphi. It's commonly used in audio file tags, for example, which I have to deal with. My software needs to search for songs with specific artists or titles, and it sounds like I'm going to have problems where the information is visually the same but entered differently in different parts of the world, using all sorts of 3rd party software. Ross. -Original Message- From: delphi-boun...@delphi.org.nz [mailto:delphi-boun...@delphi.org.nz] On Behalf Of Todd Sent: Wednesday, 24 November 2010 11:27 AM To: NZ Borland Developers Group - Delphi List Subject: Re: [DUG] Upgrading to XE - Unicode strings questions Hi John You can find out whether a unicode string is inside the BMP by converting it to UTF-32 and checking that the new string is twice the length of the original (UTF-16) string. A user could specifically choose to enter that character in either form - this is unlikely, yes. Or, two users using the same codepage could choose to enter the character differently. Or if your data is coming from two separate external sources. The *only* way to be sure is to normalise before processing. Agreed. That will eliminate any issues with composite codepoints. You only ever get issues if you cross codepage boundaries (like for example if you have users in different countries storing data in a database - which is why international databases often use UTF-8 to store data instead of their native charactersets). This makes no sense at all to me. ö encoded as #$006F + #$0308 **OR** #$00f6 even in UTF-8. Whether you encode using UTF-8, UTF-16 or UTF-32, a single accented character codepoint vs a character followed by a diacritic are still two distinct character sequences. True. I think the point is that UTF-8 is the most compact format without data loss, regardless of whether the codepoints are composite or not. Todd. ___ NZ Borland Developers Group - Delphi mailing list Post: delphi@delphi.org.nz Admin: http://delphi.org.nz/mailman/listinfo/delphi Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: unsubscribe ___ NZ Borland Developers Group - Delphi mailing list Post: delphi@delphi.org.nz Admin: http://delphi.org.nz/mailman/listinfo/delphi Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: unsubscribe ___ NZ Borland Developers Group - Delphi mailing list Post: delphi@delphi.org.nz Admin: http://delphi.org.nz/mailman/listinfo/delphi Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: unsubscribe
Re: [DUG] Upgrading to XE - Unicode strings questions
Thanks for the references, so I can answer most of the questions now. Here is what I understand so far, if anyone has anything to add this will be useful! Extra question: It looks like code like for i:=1 to length(string1) do begin DoSomethingWithOneChar(string1[i]); end; cannot be used reliably. The problems are that length(string1) looks like it cannot be safely used - as unicode characters may include 2 codepoints and length(string1) highlights that there is a difference between the number of unicode characters in a string and the number of codepoints. Still figuring out what is the best practice here, as I have quite a lot of string routines. Should be be OK as long as the unicode text actually is ASCII. Q2 – With XE do the .pas and .dfm files become unicode text and hence cannot be read by earlier Delphi, eg D2007 any more? Answer - Is a project option from what I have read?, yes not portable if unicode. Q3 – I do a lot of reading ascii data files, and writing back. Using mainly TFilestream and stringlists. Does this in general mean I will need to use file variables declared as Ansichar and AnsiString instead of Char and String? (I would prefer to use the standard VCL where possible) If I have variables as1:Ansistring; s2:string; Q4 – if I do s2:=as1 does this convert ansistrings to unicode? Answer - yes, there are performance issues to watch out for if conversion happens a lot. Q5 – if I do as1:=s2 does this convert a unicode string to ansistring? (otherwise how do I do this?) Answer - yes, there are performance issues to watch out for if conversion happens a lot. Q6 – I understand any code like char1:=string1[i]; if char1 in [‘a’..’z’] then begin message:=string[i]+’ - character is lowercase’; end will break, as ansi characters are ordinal (less than 256 or 512) and set comparisons ['a'..'z'] or ['a','b','c']can be used, this set code cannot be used for unicode characters. What is the replacement? Answer - There is CharInSet call and numerous extra housekeeping functions added in TCharacter. Q7 – do literals like #13#10 still mean carriage return and linefeed? #9 means tab? if I have code like (logline string1 string2 are string) logline:=FormatDateTime(‘dd-mmm- hh:nn:ss’,now) + string1 + #13#10+#9 + string2; ShowMessage(logline); Button1.hint:=logline; writeln(f,logline); these work D5-D2007 - ie a 2 line messagebox text, 2 line hint, and 2 lines written to a log file. is this still going to work? do carriage returns/tabs/other control characters have to be defined differently, eg as constants? Answer - not figured out yet - anyone else know? Q8 – stringlist1.loadfromfile(‘Test1.txt’); what happens if this file is ascii text being read into a stringlist which is unicode strings. Answer - Default is Ascii text for loadfromfile and savetofile, use overloaded routines for Unicode Q9 - stringlist1.savetofile(‘Test1.txt’) presumably this is no longer ascii text. How do I save and read a stringlist to/from a file if it is to be Ansi text? Q10 – If there are complexities in Q8 and Q9 is there a TAnsiStringlist type (for ansistrings) as well as a unicode TStringlist type? (I use stringlists a lot) Answer - unicodestring lists can save to ascii or unicode files, so TAnsiStringlist not needed. Q11 – do inifiles become unicode too? Answer - looks like no? Not clear? Anyone else know? Q12 – does Windows Notepad open unicode text files correctly? or can it only be used on Ansi text files? Anyone know this? Q13 - It looks like most programmers editors read and write ascii and unicode encoding.the one I use seems to distinguish between UTF-8 and unicode as well – what is the difference? Anyone know this? John ___ NZ Borland Developers Group - Delphi mailing list Post: delphi@delphi.org.nz Admin: http://delphi.org.nz/mailman/listinfo/delphi Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: unsubscribe ___ NZ Borland Developers Group - Delphi mailing list Post: delphi@delphi.org.nz Admin: http://delphi.org.nz/mailman/listinfo/delphi Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: unsubscribe
Re: [DUG] Upgrading to XE - Unicode strings questions
Just thought I would chime in that I'm really interested in the answers to these questions too (Unicode being something we are also a bit apprehensive of). -Original Message- From: delphi-boun...@delphi.org.nz [mailto:delphi-boun...@delphi.org.nz] On Behalf Of John Bird Sent: Tuesday, 23 November 2010 1:04 p.m. To: NZ Borland Developers Group - Delphi List Subject: Re: [DUG] Upgrading to XE - Unicode strings questions Thanks for the references, so I can answer most of the questions now. Here is what I understand so far, if anyone has anything to add this will be useful! Extra question: It looks like code like for i:=1 to length(string1) do begin DoSomethingWithOneChar(string1[i]); end; cannot be used reliably. The problems are that length(string1) looks like it cannot be safely used - as unicode characters may include 2 codepoints and length(string1) highlights that there is a difference between the number of unicode characters in a string and the number of codepoints. Still figuring out what is the best practice here, as I have quite a lot of string routines. Should be be OK as long as the unicode text actually is ASCII. Q2 – With XE do the .pas and .dfm files become unicode text and hence cannot be read by earlier Delphi, eg D2007 any more? Answer - Is a project option from what I have read?, yes not portable if unicode. Q3 – I do a lot of reading ascii data files, and writing back. Using mainly TFilestream and stringlists. Does this in general mean I will need to use file variables declared as Ansichar and AnsiString instead of Char and String? (I would prefer to use the standard VCL where possible) If I have variables as1:Ansistring; s2:string; Q4 – if I do s2:=as1 does this convert ansistrings to unicode? Answer - yes, there are performance issues to watch out for if conversion happens a lot. Q5 – if I do as1:=s2 does this convert a unicode string to ansistring? (otherwise how do I do this?) Answer - yes, there are performance issues to watch out for if conversion happens a lot. Q6 – I understand any code like char1:=string1[i]; if char1 in [‘a’..’z’] then begin message:=string[i]+’ - character is lowercase’; end will break, as ansi characters are ordinal (less than 256 or 512) and set comparisons ['a'..'z'] or ['a','b','c']can be used, this set code cannot be used for unicode characters. What is the replacement? Answer - There is CharInSet call and numerous extra housekeeping functions added in TCharacter. Q7 – do literals like #13#10 still mean carriage return and linefeed? #9 means tab? if I have code like (logline string1 string2 are string) logline:=FormatDateTime(‘dd-mmm- hh:nn:ss’,now) + string1 + #13#10+#9 + string2; ShowMessage(logline); Button1.hint:=logline; writeln(f,logline); these work D5-D2007 - ie a 2 line messagebox text, 2 line hint, and 2 lines written to a log file. is this still going to work? do carriage returns/tabs/other control characters have to be defined differently, eg as constants? Answer - not figured out yet - anyone else know? Q8 – stringlist1.loadfromfile(‘Test1.txt’); what happens if this file is ascii text being read into a stringlist which is unicode strings. Answer - Default is Ascii text for loadfromfile and savetofile, use overloaded routines for Unicode Q9 - stringlist1.savetofile(‘Test1.txt’) presumably this is no longer ascii text. How do I save and read a stringlist to/from a file if it is to be Ansi text? Q10 – If there are complexities in Q8 and Q9 is there a TAnsiStringlist type (for ansistrings) as well as a unicode TStringlist type? (I use stringlists a lot) Answer - unicodestring lists can save to ascii or unicode files, so TAnsiStringlist not needed. Q11 – do inifiles become unicode too? Answer - looks like no? Not clear? Anyone else know? Q12 – does Windows Notepad open unicode text files correctly? or can it only be used on Ansi text files? Anyone know this? Q13 - It looks like most programmers editors read and write ascii and unicode encoding.the one I use seems to distinguish between UTF-8 and unicode as well – what is the difference? Anyone know this? John ___ NZ Borland Developers Group - Delphi mailing list Post: delphi@delphi.org.nz Admin: http://delphi.org.nz/mailman/listinfo/delphi Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: unsubscribe ___ NZ Borland Developers Group - Delphi mailing list Post: delphi@delphi.org.nz Admin: http://delphi.org.nz/mailman/listinfo/delphi Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: unsubscribe ___ NZ Borland
Re: [DUG] Upgrading to XE - Unicode strings questions
that definition will suffice). UTF8/16/32 are different *encodings* for that character set. For UTF16 and UTF32 there are also Big and Little Endianed variants. As noted before, in Notepad, and possibly in other apps, the term “Unicode” denotes “UTF16”. UTF32 is rarely encountered in the wild, which might explain why there is no TEncoding support for it (and indeed why Notepad doesn’t appear to support it). As far as the difference between ASCII and UTF8 encoded Unicode goes: An ASCII file can represent only characters 0..128 and each character is certain to occupy a single byte. A UTF8 file can represent *EVERY* Unicode character, not just ASCII, but characters with codepoints 127 will occupy 2 or more bytes. You may have spotted that for an ASCII file, ASCII and UTF8 encoding are physically indistinguishable at the character data level. However, a *true* UTF8 file (as opposed to an ASCII file that could be treated naively as UTF8 – or vice versa) will have a BOM (Byte Order Marker). A BOM is a sequence of bytes that is prepended to a file (or stream) to indicate the Unicode encoding and identify the byte order for those encodings that have big/little endian variants. I hope that all helps a little. :-) -Original Message- From: delphi-boun...@delphi.org.nz [mailto:delphi-boun...@delphi.org.nz] On Behalf Of John Bird Sent: Tuesday, 23 November 2010 13:04 To: NZ Borland Developers Group - Delphi List Subject: Re: [DUG] Upgrading to XE - Unicode strings questions Thanks for the references, so I can answer most of the questions now. Here is what I understand so far, if anyone has anything to add this will be useful! Extra question: It looks like code like for i:=1 to length(string1) do begin DoSomethingWithOneChar(string1[i]); end; cannot be used reliably. The problems are that length(string1) looks like it cannot be safely used - as unicode characters may include 2 codepoints and length(string1) highlights that there is a difference between the number of unicode characters in a string and the number of codepoints. Still figuring out what is the best practice here, as I have quite a lot of string routines. Should be be OK as long as the unicode text actually is ASCII. Q2 – With XE do the .pas and .dfm files become unicode text and hence cannot be read by earlier Delphi, eg D2007 any more? Answer - Is a project option from what I have read?, yes not portable if unicode. Q3 – I do a lot of reading ascii data files, and writing back. Using mainly TFilestream and stringlists. Does this in general mean I will need to use file variables declared as Ansichar and AnsiString instead of Char and String? (I would prefer to use the standard VCL where possible) If I have variables as1:Ansistring; s2:string; Q4 – if I do s2:=as1 does this convert ansistrings to unicode? Answer - yes, there are performance issues to watch out for if conversion happens a lot. Q5 – if I do as1:=s2 does this convert a unicode string to ansistring? (otherwise how do I do this?) Answer - yes, there are performance issues to watch out for if conversion happens a lot. Q6 – I understand any code like char1:=string1[i]; if char1 in [‘a’..’z’] then begin message:=string[i]+’ - character is lowercase’; end will break, as ansi characters are ordinal (less than 256 or 512) and set comparisons ['a'..'z'] or ['a','b','c']can be used, this set code cannot be used for unicode characters. What is the replacement? Answer - There is CharInSet call and numerous extra housekeeping functions added in TCharacter. Q7 – do literals like #13#10 still mean carriage return and linefeed? #9 means tab? if I have code like (logline string1 string2 are string) logline:=FormatDateTime(‘dd-mmm- hh:nn:ss’,now) + string1 + #13#10+#9 + string2; ShowMessage(logline); Button1.hint:=logline; writeln(f,logline); these work D5-D2007 - ie a 2 line messagebox text, 2 line hint, and 2 lines written to a log file. is this still going to work? do carriage returns/tabs/other control characters have to be defined differently, eg as constants? Answer - not figured out yet - anyone else know? Q8 – stringlist1.loadfromfile(‘Test1.txt’); what happens if this file is ascii text being read into a stringlist which is unicode strings. Answer - Default is Ascii text for loadfromfile and savetofile, use overloaded routines for Unicode Q9 - stringlist1.savetofile(‘Test1.txt’) presumably this is no longer ascii text. How do I save and read a stringlist to/from a file if it is to be Ansi text? Q10 – If there are complexities in Q8 and Q9 is there a TAnsiStringlist type (for ansistrings) as well as a unicode TStringlist type? (I use stringlists a lot) Answer - unicodestring lists can save
Re: [DUG] Upgrading to XE - Unicode strings questions
You beat me to it. I was going to say the same, that I'm interested in these answers also. I have customers all over the world and just recently the display of Chinese characters was desired in a non-Chinese speaking country. So eventually I'll have to convert to Unicode. Ross. -Original Message- From: delphi-boun...@delphi.org.nz [mailto:delphi-boun...@delphi.org.nz] On Behalf Of David Brennan Sent: Tuesday, 23 November 2010 1:27 PM To: 'NZ Borland Developers Group - Delphi List' Subject: Re: [DUG] Upgrading to XE - Unicode strings questions Just thought I would chime in that I'm really interested in the answers to these questions too (Unicode being something we are also a bit apprehensive of). -Original Message- From: delphi-boun...@delphi.org.nz [mailto:delphi-boun...@delphi.org.nz] On Behalf Of John Bird Sent: Tuesday, 23 November 2010 1:04 p.m. To: NZ Borland Developers Group - Delphi List Subject: Re: [DUG] Upgrading to XE - Unicode strings questions Thanks for the references, so I can answer most of the questions now. Here is what I understand so far, if anyone has anything to add this will be useful! Extra question: It looks like code like for i:=1 to length(string1) do begin DoSomethingWithOneChar(string1[i]); end; cannot be used reliably. The problems are that length(string1) looks like it cannot be safely used - as unicode characters may include 2 codepoints and length(string1) highlights that there is a difference between the number of unicode characters in a string and the number of codepoints. Still figuring out what is the best practice here, as I have quite a lot of string routines. Should be be OK as long as the unicode text actually is ASCII. Q2 – With XE do the .pas and .dfm files become unicode text and hence cannot be read by earlier Delphi, eg D2007 any more? Answer - Is a project option from what I have read?, yes not portable if unicode. Q3 – I do a lot of reading ascii data files, and writing back. Using mainly TFilestream and stringlists. Does this in general mean I will need to use file variables declared as Ansichar and AnsiString instead of Char and String? (I would prefer to use the standard VCL where possible) If I have variables as1:Ansistring; s2:string; Q4 – if I do s2:=as1 does this convert ansistrings to unicode? Answer - yes, there are performance issues to watch out for if conversion happens a lot. Q5 – if I do as1:=s2 does this convert a unicode string to ansistring? (otherwise how do I do this?) Answer - yes, there are performance issues to watch out for if conversion happens a lot. Q6 – I understand any code like char1:=string1[i]; if char1 in [‘a’..’z’] then begin message:=string[i]+’ - character is lowercase’; end will break, as ansi characters are ordinal (less than 256 or 512) and set comparisons ['a'..'z'] or ['a','b','c']can be used, this set code cannot be used for unicode characters. What is the replacement? Answer - There is CharInSet call and numerous extra housekeeping functions added in TCharacter. Q7 – do literals like #13#10 still mean carriage return and linefeed? #9 means tab? if I have code like (logline string1 string2 are string) logline:=FormatDateTime(‘dd-mmm- hh:nn:ss’,now) + string1 + #13#10+#9 + string2; ShowMessage(logline); Button1.hint:=logline; writeln(f,logline); these work D5-D2007 - ie a 2 line messagebox text, 2 line hint, and 2 lines written to a log file. is this still going to work? do carriage returns/tabs/other control characters have to be defined differently, eg as constants? Answer - not figured out yet - anyone else know? Q8 – stringlist1.loadfromfile(‘Test1.txt’); what happens if this file is ascii text being read into a stringlist which is unicode strings. Answer - Default is Ascii text for loadfromfile and savetofile, use overloaded routines for Unicode Q9 - stringlist1.savetofile(‘Test1.txt’) presumably this is no longer ascii text. How do I save and read a stringlist to/from a file if it is to be Ansi text? Q10 – If there are complexities in Q8 and Q9 is there a TAnsiStringlist type (for ansistrings) as well as a unicode TStringlist type? (I use stringlists a lot) Answer - unicodestring lists can save to ascii or unicode files, so TAnsiStringlist not needed. Q11 – do inifiles become unicode too? Answer - looks like no? Not clear? Anyone else know? Q12 – does Windows Notepad open unicode text files correctly? or can it only be used on Ansi text files? Anyone know this? Q13 - It looks like most programmers editors read and write ascii and unicode encoding.the one I use seems to distinguish between UTF-8 and unicode as well – what is the difference? Anyone know this? John
Re: [DUG] Upgrading to XE - Unicode strings questions
You should get a copy of Marco Cantu's Delphi 2009 Handbook - it has about 90 pages on Unicode in Delphi. I think Bob Swart has a similar (less detailed) book. There is also some videos from one of the CodeRage events (probably CodeRage 3 or 4). Alister Christie Computers for People Ph: 04 471 1849 Fax: 04 471 1266 http://www.salespartner.co.nz PO Box 13085 Johnsonville Wellington On 18/11/2010 5:48 p.m., John Bird wrote: Planning upgrading from D2007 to XE, but want to read up on issues I will need to consider first to do with strings becoming Unicode by default. I recall the release of D2009 came with good white papers explaining ramifications, however I haven’t seen these as I haven’t upgraded. Asked for such also at the XE event but have not been sent anything yet. I have a lot of code which I want to plan to be able to recompile easily, and would like to plan this migration. I would prefer to put anything contentious or varying into a library unit, a ‘wrapper’ so that I don’t have to deal with these version differences in the main code... Anyone can answer any of these quick questions please post here or email me – thanks! Q1 - Anyone got some good references to read up on ansistring to unicode issues ? Comprehensive please! Q2 – With XE do the .pas and .dfm files become unicode text and hence cannot be read by earlier Delphi, eg D2007 any more? Q3 – I do a lot of reading ascii data files, and writing back. Using mainly TFilestream and stringlists. Does this in general mean I will need to use file variables declared as Ansichar and AnsiString instead of Char and String? (I would prefer to use the standard VCL where possible) If I have variables as1:Ansistring; s2:string; Q4 – if I do s2:=as1 does this convert ansistrings to unicode? Q5 – if I do as1:=s2 does this convert a unicode string to ansstring? (otherwise how do I do this?) Q6 – I understand any code like char1:=string1[i]; if char1 in [‘a’..’z’] then begin message:=string[i]+’ - character is lowercase’; end will break, as ansi characters are ordinal (less than 256 or 512) and set comparisons ['a'..'z'] or ['a','b','c']can be used, this set code cannot be used for unicode characters. What is the replacement? Q7 – do literals like #13#10 still mean carriage return and linefeed? #9 means tab? if I have code like (logline string1 string2 are string) logline:=FormatDateTime(‘dd-mmm- hh:nn:ss’,now) + string1 + #13#10+#9 + string2; ShowMessage(logline); Button1.hint:=logline; writeln(f,logline); these work D5-D2007 - ie a 2 line messagebox text, 2 line hint, and 2 lines written to a log file. is this still going to work? do carriage returns/tabs/other control characters have to be defined differently, eg as constants? Q8 – stringlist1.loadfromfile(‘Test1.txt’); what happens if this file is ascii text being read into a stringlist which is unicode strings. Q9 - stringlist1.savetofile(‘Test1.txt’) presumably this is no longer ascii text. How do I save and read a stringlist to/from a file if it is to be Ansi text? Q10 – If there are complexities in Q8 and Q9 is there a TAnsiStringlist type (for ansistrings) as well as a unicode TStringlist type? (I use stringlists a lot) Q11 – do inifiles become unicode too? Q12 – does Windows Notepad open unicode text files correctly? or can it only be used on Ansi text files? Q13 - It looks like most programmers editors read and write ascii and unicode encoding.the one I use seems to distinguish between UTF-8 and unicode as well – what is the difference? John ___ NZ Borland Developers Group - Delphi mailing list Post: delphi@delphi.org.nz Admin: http://delphi.org.nz/mailman/listinfo/delphi Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: unsubscribe ___ NZ Borland Developers Group - Delphi mailing list Post: delphi@delphi.org.nz Admin: http://delphi.org.nz/mailman/listinfo/delphi Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: unsubscribe
Re: [DUG] Upgrading to XE - Unicode strings questions
I won't answer everything but just on this one question: On 23 November 2010 11:04, John Bird johnkb...@paradise.net.nz wrote: Extra question: It looks like code like for i:=1 to length(string1) do begin DoSomethingWithOneChar(string1[i]); end; cannot be used reliably. The problems are that length(string1) looks like it cannot be safely used - as unicode characters may include 2 codepoints and length(string1) highlights that there is a difference between the number of unicode characters in a string and the number of codepoints. Still figuring out what is the best practice here, as I have quite a lot of string routines. Should be be OK as long as the unicode text actually is ASCII. you can use something like this: var C: Char; ... for C in String1 do begin DoSomethingWithOneChar(C); end; In this case you don't need to know the index of each character, you just get the char using the for..in..do loop. ___ NZ Borland Developers Group - Delphi mailing list Post: delphi@delphi.org.nz Admin: http://delphi.org.nz/mailman/listinfo/delphi Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: unsubscribe
Re: [DUG] Upgrading to XE - Unicode strings questions
Jolyon beat me to answer those questions .. but here are my additional 2 cents: Q1: Unicode strings treat each character as 2 bytes - length returns the number of characters, not the size of memory allocated. Each access to it with an array syntax returns you a widechar instead of an ansichar. Your DoSomethingWithOneChar procedure will be called with a widechar as input but that probably won't cause any problems as widechar is a superset of ansichar so there won't be any issues when going in that direction. Q8: stringlist.loadfromfile will auto-detect the encoding by looking for magic markers (BOM code 0xEF 0xBB 0xBF for UTF8 at the beginning of the file) and other things, like Unicode-codepoint encoding validity. Q11: inifiles: yes, these files will now have support for Unicode too. Q13: Unicode is synonymous for “character encoding of the universal character set” – so it actually consists of two parts, the character set (about 109,000 characters are officially defined) and the various encoding formats that are used to represent those characters (utf8/utf16/utf32/ucs2/ucs4/etc). Windows started with UCS-2 (in Windows NT) and then switched to UTF16. UCS-2 only allowed 65535 characters so Microsoft had to switch to UTF-16 in newer windows version to support the full character set. This means that some weird and/or no longer used characters from dead/historic languages can sometimes take up more than 2 bytes (the size of a widechar) – this isn’t usually an issue when developing Unicode enabled applications … unless your software needs to handle and display things like “cuneiform script” perfectly. Kind Regards, Stefan Mueller ___ RD Manager ORCL Toolbox LLP, Japan http://www.orcl-toolbox.com http://www.orcl-toolbox.com/ From: delphi-boun...@delphi.org.nz [mailto:delphi-boun...@delphi.org.nz] On Behalf Of Jolyon Smith Sent: Tuesday, November 23, 2010 9:40 AM To: 'NZ Borland Developers Group - Delphi List' Subject: Re: [DUG] Upgrading to XE - Unicode strings questions I'm guessing my response to your previous email didn't come thru for some reason - resending: I shall address some of your questions that I can answer quickly: Q2 – With XE do the .pas and .dfm files become unicode text and hence cannot be read by earlier Delphi, eg D2007 any more? I forget precisely which version of the IDE introduced the change, but the IDE has for some time supported different encodings for source/DFM files. Certainly this was present in D2006 and it may even have been as far back as D7 or even earlier that it was introduced. (Right click in source/dfm file and choose File Format from the context menu to see/change the file encoding) Q3 – I do a lot of reading ascii data files, and writing back. Using mainly TFilestream and stringlists. Which TFileStream you should be OK, as long as you read/write into ANSIString/ANSIChar buffers as you already surmised. With TStringList you are forced to push your data through a Unicode/ANSI conversion when reading/writing from/to ANSI files, since the TStringList itself holds UnicodeString items. You can do this using the new Encoding parameter to the relevant methods of the class to ensure you read/write the correct/expected encoding (reading should correctly detect the encoding, but when writing you will need to be explicit). Q4 – if I do s2:=as1 does this convert ansistrings to unicode? Q5 – if I do as1:=s2 does this convert a unicode string to ansstring? Yes, but you will get a warning when going from Unicode to ANSI (since not all ANSI encodings will support the possible content of a Unicode string). To avoid this, be explicit with the conversion. Q6 – I understand any code like char1:=string1[i]; if char1 in [‘a’..’z’] then begin message:=string[i]+’ - character is lowercase’; end will break. Nope, it's fine. But again, you will get a warning, in this case that the WIDECHAR has been reduced to a BYTE (NOTE: not converted to ANSICHAR) and a suggestion that you use CharInSet() instead. Note however that CharInSet contains no real magic that makes sets work for 255 elements - it merely provides a wrapper around code that will avoid the suggestion that you use CharInSet(). You can achieve the same effect by again simply being explicit that you know that what you are doing is intended and safe by reducing the WideChar to an ANSIChar yourself: if ANSICHAR(char1) in ['a'..'z'] then To my mind this is preferable to using CharInSet() as it makes it clearer in the code what is going on (that non-ANSIChars are not expected and may not be handled as intended). Using CharInSet() won't make any material difference to the behaviour of the code, but it would make it less apparent what is going on (i.e. that your code deals specifically with ANSI chars). CharInSet() performs a test
Re: [DUG] Upgrading to XE - Unicode strings questions
Colin, the for C in loop and the for i := 1 to Length() loops are functionally identical! The only difference is that the for in version incurs the slight overhead of the enumerator framework invoked by the compiler and runtime magic to support that syntax. But in neither case will the loop itself help detect/respond to surrogate pairs (a single WideChar is potentially only ½ the data required to form a complete character). The only way to reduce an iterator over a string to a simple char-wise loop, whether explicit or using enumerators, is to first convert to UTF32, the facilities for which in the Delphi RTL are cough rudimentary, to put it politely. Non-existent may be nearer the mark. The precise mechanics of the loop construct used is not material to that problem. However, just as before Unicode when most people didnt care and just wrote code that assumed ANSI==ASCII, these days people wont care and will write code that assumes that Unicode==BMP (Basic Multilingual Plane), ignoring surrogate pairs just as they used to ignore extended ASCII and ANSI characters. And for most people, that will probably actually work. J From: delphi-boun...@delphi.org.nz [mailto:delphi-boun...@delphi.org.nz] On Behalf Of Colin Johnsun Sent: Tuesday, 23 November 2010 14:31 To: NZ Borland Developers Group - Delphi List Subject: Re: [DUG] Upgrading to XE - Unicode strings questions I won't answer everything but just on this one question: On 23 November 2010 11:04, John Bird johnkb...@paradise.net.nz wrote: Extra question: It looks like code like for i:=1 to length(string1) do begin DoSomethingWithOneChar(string1[i]); end; cannot be used reliably. The problems are that length(string1) looks like it cannot be safely used - as unicode characters may include 2 codepoints and length(string1) highlights that there is a difference between the number of unicode characters in a string and the number of codepoints. Still figuring out what is the best practice here, as I have quite a lot of string routines. Should be be OK as long as the unicode text actually is ASCII. you can use something like this: var C: Char; ... for C in String1 do begin DoSomethingWithOneChar(C); end; In this case you don't need to know the index of each character, you just get the char using the for..in..do loop. ___ NZ Borland Developers Group - Delphi mailing list Post: delphi@delphi.org.nz Admin: http://delphi.org.nz/mailman/listinfo/delphi Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: unsubscribe
Re: [DUG] Upgrading to XE - Unicode strings questions
Doh! Thanks Jolyon for clearing that misunderstanding on my part. I was aware of the surrogate pair issue but I wrongly assumed that this might have been taken care by the iterator implementation. I guess not. Thanks again! Cheers, Colin On 23 November 2010 13:06, Jolyon Smith jsm...@deltics.co.nz wrote: Colin, the for C in loop and the for i := 1 to Length() loops are functionally identical! The only difference is that the “for in” version incurs the slight overhead of the enumerator framework invoked by the compiler and runtime magic to support that syntax. But in neither case will the loop itself help detect/respond to surrogate pairs (a single “WideChar” is potentially only ½ the data required to form a complete “*character*”). The only way to reduce an iterator over a string to a simple char-wise loop, whether explicit or using enumerators, is to first convert to UTF32, the facilities for which in the Delphi RTL are cough rudimentary, to put it politely. Non-existent may be nearer the mark. The precise mechanics of the loop construct used is not material to that problem. However, just as before Unicode when most people didn’t care and just wrote code that assumed ANSI==ASCII, these days people won’t care and will write code that assumes that Unicode==BMP (Basic Multilingual Plane), ignoring surrogate pairs just as they used to ignore extended ASCII and ANSI characters. And for most people, that will probably actually work. J *From:* delphi-boun...@delphi.org.nz [mailto:delphi-boun...@delphi.org.nz] *On Behalf Of *Colin Johnsun *Sent:* Tuesday, 23 November 2010 14:31 *To:* NZ Borland Developers Group - Delphi List *Subject:* Re: [DUG] Upgrading to XE - Unicode strings questions I won't answer everything but just on this one question: On 23 November 2010 11:04, John Bird johnkb...@paradise.net.nz wrote: Extra question: It looks like code like for i:=1 to length(string1) do begin DoSomethingWithOneChar(string1[i]); end; cannot be used reliably. The problems are that length(string1) looks like it cannot be safely used - as unicode characters may include 2 codepoints and length(string1) highlights that there is a difference between the number of unicode characters in a string and the number of codepoints. Still figuring out what is the best practice here, as I have quite a lot of string routines. Should be be OK as long as the unicode text actually is ASCII. you can use something like this: var C: Char; ... for C in String1 do begin DoSomethingWithOneChar(C); end; In this case you don't need to know the index of each character, you just get the char using the for..in..do loop. ___ NZ Borland Developers Group - Delphi mailing list Post: delphi@delphi.org.nz Admin: http://delphi.org.nz/mailman/listinfo/delphi Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: unsubscribe ___ NZ Borland Developers Group - Delphi mailing list Post: delphi@delphi.org.nz Admin: http://delphi.org.nz/mailman/listinfo/delphi Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: unsubscribe
Re: [DUG] Upgrading to XE - Unicode strings questions
My main remaining question is the best way to handle code that up to now looked like: for i:=1 to length(string1) do begin DoSomethingWithOneChar(string1[i]); end; If I got the gist correctly, string1[i] is one unicode character, but length(string1) is the number of codepoints in the string and not the number of characters. This is gonna be confusing! Other comments: Comment 1 - I saw quite a few commentators say that they in general approved of the way that the unicode had been implemented - everything that was ansi string before is now unicode consistently throughout the whole language and IDE, and in the main the only code that needs altering is where Delphi is communicating outside the standard language: ie -DLL calls -SavetoFile and LoadFromFile and other file access - even here smart defaults have been put in to retain expected behaviour. -Sending strings to COM/TCP etc you might need to convert to get the kind expected -Database fields - usually handled by making sure the right encoding is sent. Comment 2 - The worst inconveniences are for those who have already tried to do some unicode type processing using WideChar, and the functions that were used for these.Undoing these changes is usually the best way to cater for unicode.Also some of the routines introduced then have horribly confusing names, like AnsiPos which is for searching widechars and is still what should be used for searching.It seems to me that some identical routines should be introduced - eg called UnicodePos(.) just so that those who are new to Unicode can use at least a consistently named set of tools.I would probably make routines named like this which I use just to be clear. Comment 3 - I see a few people arguing that there should have been a compiler switch to allow compiling to ansistring or unicode string depending on the compiler switch, to ease converting people to D2009/XE. There are merits either way on this - in the long term if everyone is going to have to live in a unicode world then its probably better to bite the bullet and be made to convert code as eventually you cannot escape it. In such a case a simpler compiler and VCL is a big advantage. This is sort of related to being able to cross compile to 64 bit, iPhone, Android - whatever way makes it easy to have these forward looking options.The quite stark reality is that in 5 years it looks like much but not all commercial software will be running on Windows, its likely to be a mix of Web/iPhone/Android/GoogleOS/MacOS so the forwards portability of compiling Delphi for different environments is way more important than whether it should be able to do Strings as AnsiString. Comment 4 - Has anyone at Embarcadero considered 2 ways to make cross platform?option A is to go for a native compiler for different OS's - best if can be done. option B is the Java route - compile to intermediate code for a Delphi Virtual Machine which can run interpreted with a runtime on many OS's. Could be called the Delphi Virtual VCL Machine. The reason why this might be a good way to go is that Delphi was originally designed as a teaching language - ie formally very strongly typed and formally well structured language- it could be about the best candidate around for generalised compiling and a simple cross platform runtime. Also with Java now owned by Oracle there is questions over if it has such a bright future and there is room for another similar approach. DotNet is a similar idea too, but will only ever really be Windows. A Delphi Virtual Machine might not matter too much if its slower if its portable. [But I digress - The last point is way off topic for Unicode however] Comment and question 5 - What is the status of Free Pascal/Lazarus wrt to unicode?Does Delphi XE code port or not to Free Pascal?Its an issue to consider as well. ___ NZ Borland Developers Group - Delphi mailing list Post: delphi@delphi.org.nz Admin: http://delphi.org.nz/mailman/listinfo/delphi Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: unsubscribe
Re: [DUG] Upgrading to XE - Unicode strings questions
@ Colin : No worries. @All : One other thing to point out is that when working with genuine, actual Unicode strings you should be careful to use the correct ANSI() functions... yes, you read that right. S := Uppercase(S); Will NOT convert Unicode characters (just as it would previously have not converted non-ASCII characters). S := ANSIUppercase( S ); On the other hand will. The same goes for the likes of SameText() vs ANSISameText() etc. If you were writing for extended character sets in the past you were most likely already using these routines, but if you werent (perhaps because Delphi doesnt support extended chars very well) and are now thinking that by simply upgrading to a Unicode Delphi all such things are magically taken care of, you will be in for a shock. Better yet, use the routines introduced in the Character unit (why not UnicodeUtils? DOH!) The only problem you then have is if you want to write string handling/manipulating code that will be portable between Unicode and non-Unicode Delphi compilers. From: delphi-boun...@delphi.org.nz [mailto:delphi-boun...@delphi.org.nz] On Behalf Of Colin Johnsun Sent: Tuesday, 23 November 2010 15:22 To: NZ Borland Developers Group - Delphi List Subject: Re: [DUG] Upgrading to XE - Unicode strings questions Doh! Thanks Jolyon for clearing that misunderstanding on my part. I was aware of the surrogate pair issue but I wrongly assumed that this might have been taken care by the iterator implementation. I guess not. Thanks again! Cheers, Colin On 23 November 2010 13:06, Jolyon Smith jsm...@deltics.co.nz wrote: Colin, the for C in loop and the for i := 1 to Length() loops are functionally identical! The only difference is that the for in version incurs the slight overhead of the enumerator framework invoked by the compiler and runtime magic to support that syntax. But in neither case will the loop itself help detect/respond to surrogate pairs (a single WideChar is potentially only ½ the data required to form a complete character). The only way to reduce an iterator over a string to a simple char-wise loop, whether explicit or using enumerators, is to first convert to UTF32, the facilities for which in the Delphi RTL are cough rudimentary, to put it politely. Non-existent may be nearer the mark. The precise mechanics of the loop construct used is not material to that problem. However, just as before Unicode when most people didnt care and just wrote code that assumed ANSI==ASCII, these days people wont care and will write code that assumes that Unicode==BMP (Basic Multilingual Plane), ignoring surrogate pairs just as they used to ignore extended ASCII and ANSI characters. And for most people, that will probably actually work. J From: delphi-boun...@delphi.org.nz [mailto:delphi-boun...@delphi.org.nz] On Behalf Of Colin Johnsun Sent: Tuesday, 23 November 2010 14:31 To: NZ Borland Developers Group - Delphi List Subject: Re: [DUG] Upgrading to XE - Unicode strings questions I won't answer everything but just on this one question: On 23 November 2010 11:04, John Bird johnkb...@paradise.net.nz wrote: Extra question: It looks like code like for i:=1 to length(string1) do begin DoSomethingWithOneChar(string1[i]); end; cannot be used reliably. The problems are that length(string1) looks like it cannot be safely used - as unicode characters may include 2 codepoints and length(string1) highlights that there is a difference between the number of unicode characters in a string and the number of codepoints. Still figuring out what is the best practice here, as I have quite a lot of string routines. Should be be OK as long as the unicode text actually is ASCII. you can use something like this: var C: Char; ... for C in String1 do begin DoSomethingWithOneChar(C); end; In this case you don't need to know the index of each character, you just get the char using the for..in..do loop. ___ NZ Borland Developers Group - Delphi mailing list Post: delphi@delphi.org.nz Admin: http://delphi.org.nz/mailman/listinfo/delphi Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: unsubscribe ___ NZ Borland Developers Group - Delphi mailing list Post: delphi@delphi.org.nz Admin: http://delphi.org.nz/mailman/listinfo/delphi Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: unsubscribe
Re: [DUG] Upgrading to XE - Unicode strings questions
?As I understand it iterating over a string with Chars does get around the problem of surrogate pairs, as any character you are currently on might be either 1,2 or more bytes if it contains surrogate pairs, but just one unicode character. So if one is after iterating over the characters in the string your code should be perfect. My question is if you are not using for C in String1 do and want to use for i:=1 to length(string1) do what do you use instead of length to get the number of characters in the string in general? length is not the number of characters, its the umber of code-points (including surrogate pairs counted separately) if I understand correctly. Separate issue - I understand that if one wants to iterate over the bytes of a string then one uses byte rather than char, and then one does have to investigate each byte to see if it is part of a surrogate pair. There look to be routines for this – however I am guessing most won’t be needing to do this. Fortunately! Also – I think getting what we used to call the ASCII value of a character, or creating a character still works the same- in fact for english alphabet the codes are the same I understand? Can someone confirm. (ie the character might use 2 bytes if encoded as unicode string, but the value stored for ‘A’ is still 41 hex or 65 decimal. Which means I think that one can do code1,code2:integer; char1:ansichar; char2:char; char1:=’A’; char2:=’A’;//unicode char 2 bytes code1:=ord(char1); code2:=ord(char2); in this case I think code1=code2 ?? anyone confirm this. Of course once one goes away from English/latin 8859 characters this is no longer going to be true. John Doh! Thanks Jolyon for clearing that misunderstanding on my part. I was aware of the surrogate pair issue but I wrongly assumed that this might have been taken care by the iterator implementation. I guess not. Thanks again! Cheers, Colin On 23 November 2010 13:06, Jolyon Smith jsm...@deltics.co.nz wrote: Colin, the for C in loop and the for i := 1 to Length() loops are functionally identical! The only difference is that the “for in” version incurs the slight overhead of the enumerator framework invoked by the compiler and runtime magic to support that syntax. But in neither case will the loop itself help detect/respond to surrogate pairs (a single “WideChar” is potentially only ½ the data required to form a complete “character”). The only way to reduce an iterator over a string to a simple char-wise loop, whether explicit or using enumerators, is to first convert to UTF32, the facilities for which in the Delphi RTL are cough rudimentary, to put it politely. Non-existent may be nearer the mark. The precise mechanics of the loop construct used is not material to that problem. However, just as before Unicode when most people didn’t care and just wrote code that assumed ANSI==ASCII, these days people won’t care and will write code that assumes that Unicode==BMP (Basic Multilingual Plane), ignoring surrogate pairs just as they used to ignore extended ASCII and ANSI characters. And for most people, that will probably actually work. J From: delphi-boun...@delphi.org.nz [mailto:delphi-boun...@delphi.org.nz] On Behalf Of Colin Johnsun Sent: Tuesday, 23 November 2010 14:31 To: NZ Borland Developers Group - Delphi List Subject: Re: [DUG] Upgrading to XE - Unicode strings questions I won't answer everything but just on this one question: On 23 November 2010 11:04, John Bird johnkb...@paradise.net.nz wrote: Extra question: It looks like code like for i:=1 to length(string1) do begin DoSomethingWithOneChar(string1[i]); end; cannot be used reliably. The problems are that length(string1) looks like it cannot be safely used - as unicode characters may include 2 codepoints and length(string1) highlights that there is a difference between the number of unicode characters in a string and the number of codepoints. Still figuring out what is the best practice here, as I have quite a lot of string routines. Should be be OK as long as the unicode text actually is ASCII. you can use something like this: var C: Char; ... for C in String1 do begin DoSomethingWithOneChar(C); end; In this case you don't need to know the index of each character, you just get the char using the for..in..do loop. ___ NZ Borland Developers Group - Delphi mailing list Post: delphi@delphi.org.nz Admin: http://delphi.org.nz/mailman/listinfo/delphi Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: unsubscribe ___ NZ Borland Developers Group - Delphi mailing list Post: delphi
Re: [DUG] Upgrading to XE - Unicode strings questions
No on two counts: String[1] is one WIDE character, which may or may not be a complete Unicode codePOINT (and so equally may or may not be a complete Unicode character, although the definition of what constitutes a character in Unicode is a whole separate topic). Length( s ) will always yield the number of chars in s. The only wrinkle that Unicode introduces here is that the number of chars no longer == the number of *bytes* (each char is a WIDEChar and therefore 2 bytes). But you can still reliably index each WIDEChar in a WIDEString using the [nth] element index. Strings in COM have always been WideString - conversion to/from UnicodeString is automatic and lossless (in terms of data). TCP, yes you will have to do work to support Unicode in this area if you haven't already done so (but the internet has - if not entirely, then in large part - been Unicode for a long time now, so you really should have taken care of this already, regardless of the Unicode-ness or otherwise of your Delphi code itself). But that applies to ANY external systems with which your code interacts that may already be Unicode (or indeed which will remain resolutely ANSI, even if your app becomes Unicode). In addition to inconveniences for people who had already done some work to support Unicode, the implementation does little/nothing to encourage or promote *correct* Unicode support in new projects and introduces potential for confusion and mistakes in many areas imho. The entire string handling area of the RTL should have been thrown out and a properly thought out framework introduced to replace it, and yes, we should have been forced to migrate to the new, consistent and comprehensive string RTL (or at least encouraged, by marking all existing RTL support as deprecated). PLUS, for the backwards compatability crowd, they *could* have supported a String == Unicode compiler switch imho (not just an I wish they had - I can see technically precisely HOW it would and could have been implemented, and it fits perfectly with their own advice for how to deal with code that is problematic to convert to Unicode). Whilst at a technical level this may not have been a huge advantage, it certainly would have been a welcome comfort to people facing the job of converting large applications with libraries of - in some cases no longer supported - 3rd party library code, by enabling them to flag those units as ANSI and deal with the conversion warnings that would have subsequently been emitted by linking with the Unicode VCL. The only real argument against a compiler switch comes from the view that having two versions of the VCL - one Unicode and one ANSI - would have been required and would have been unworkable. This is not the case IMHO. The VCL could have gone unilaterally and fixedly String==UnicodeString whilst allowing us to compile our own units with String==ANSI/UnicodeString As I say, the technique of enforcing ANSI-ness in unsafe Unicode units in order to defer the job of migrating those units to Unicode is well documented and is the official advice in such difficult cases. A compiler switch as I envisage it would simply have made that process more straightforward - the net effect would have been the same, which on its own demonstrates that such a switch was in fact technically possibly IF IMPLEMENTED IN THAT WAY, despite the protestations to the contrary (which assume a DIFFERENT implementation approach). Too late now of course. :) -Original Message- From: delphi-boun...@delphi.org.nz [mailto:delphi-boun...@delphi.org.nz] On Behalf Of John Bird Sent: Tuesday, 23 November 2010 15:36 To: NZ Borland Developers Group - Delphi List Subject: Re: [DUG] Upgrading to XE - Unicode strings questions My main remaining question is the best way to handle code that up to now looked like: for i:=1 to length(string1) do begin DoSomethingWithOneChar(string1[i]); end; If I got the gist correctly, string1[i] is one unicode character, but length(string1) is the number of codepoints in the string and not the number of characters. This is gonna be confusing! Other comments: Comment 1 - I saw quite a few commentators say that they in general approved of the way that the unicode had been implemented - everything that was ansi string before is now unicode consistently throughout the whole language and IDE, and in the main the only code that needs altering is where Delphi is communicating outside the standard language: ie -DLL calls -SavetoFile and LoadFromFile and other file access - even here smart defaults have been put in to retain expected behaviour. -Sending strings to COM/TCP etc you might need to convert to get the kind expected -Database fields - usually handled by making sure the right encoding is sent. Comment 2 - The worst inconveniences are for those who have already tried to do some unicode type processing using WideChar
Re: [DUG] Upgrading to XE - Unicode strings questions
Hi John Extra question: It looks like code like for i:=1 to length(string1) do begin DoSomethingWithOneChar(string1[i]); end; cannot be used reliably. I think the solution here is to not to concentrate so much on unicode, but rather on what DoSomethingWithOneChar() is trying to achieve. Does the function even make sense for non-ANSI characters? Todd. The problems are that length(string1) looks like it cannot be safely used - as unicode characters may include 2 codepoints and length(string1) highlights that there is a difference between the number of unicode characters in a string and the number of codepoints. Still figuring out what is the best practice here, as I have quite a lot of string routines. Should be be OK as long as the unicode text actually is ASCII. Q2 – With XE do the .pas and .dfm files become unicode text and hence cannot be read by earlier Delphi, eg D2007 any more? Answer - Is a project option from what I have read?, yes not portable if unicode. Q3 – I do a lot of reading ascii data files, and writing back. Using mainly TFilestream and stringlists. Does this in general mean I will need to use file variables declared as Ansichar and AnsiString instead of Char and String? (I would prefer to use the standard VCL where possible) If I have variables as1:Ansistring; s2:string; Q4 – if I do s2:=as1 does this convert ansistrings to unicode? Answer - yes, there are performance issues to watch out for if conversion happens a lot. Q5 – if I do as1:=s2 does this convert a unicode string to ansistring? (otherwise how do I do this?) Answer - yes, there are performance issues to watch out for if conversion happens a lot. Q6 – I understand any code like char1:=string1[i]; if char1 in [‘a’..’z’] then begin message:=string[i]+’ - character is lowercase’; end will break, as ansi characters are ordinal (less than 256 or 512) and set comparisons ['a'..'z'] or ['a','b','c']can be used, this set code cannot be used for unicode characters. What is the replacement? Answer - There is CharInSet call and numerous extra housekeeping functions added in TCharacter. Q7 – do literals like #13#10 still mean carriage return and linefeed? #9 means tab? if I have code like (logline string1 string2 are string) logline:=FormatDateTime(‘dd-mmm- hh:nn:ss’,now) + string1 + #13#10+#9 + string2; ShowMessage(logline); Button1.hint:=logline; writeln(f,logline); these work D5-D2007 - ie a 2 line messagebox text, 2 line hint, and 2 lines written to a log file. is this still going to work? do carriage returns/tabs/other control characters have to be defined differently, eg as constants? Answer - not figured out yet - anyone else know? Q8 – stringlist1.loadfromfile(‘Test1.txt’); what happens if this file is ascii text being read into a stringlist which is unicode strings. Answer - Default is Ascii text for loadfromfile and savetofile, use overloaded routines for Unicode Q9 - stringlist1.savetofile(‘Test1.txt’) presumably this is no longer ascii text. How do I save and read a stringlist to/from a file if it is to be Ansi text? Q10 – If there are complexities in Q8 and Q9 is there a TAnsiStringlist type (for ansistrings) as well as a unicode TStringlist type? (I use stringlists a lot) Answer - unicodestring lists can save to ascii or unicode files, so TAnsiStringlist not needed. Q11 – do inifiles become unicode too? Answer - looks like no? Not clear? Anyone else know? Q12 – does Windows Notepad open unicode text files correctly? or can it only be used on Ansi text files? Anyone know this? Q13 - It looks like most programmers editors read and write ascii and unicode encoding.the one I use seems to distinguish between UTF-8 and unicode as well – what is the difference? Anyone know this? John ___ NZ Borland Developers Group - Delphi mailing list Post: delphi@delphi.org.nz Admin: http://delphi.org.nz/mailman/listinfo/delphi Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: unsubscribe ___ NZ Borland Developers Group - Delphi mailing list Post: delphi@delphi.org.nz Admin: http://delphi.org.nz/mailman/listinfo/delphi Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: unsubscribe ___ NZ Borland Developers Group - Delphi mailing list Post: delphi@delphi.org.nz Admin: http://delphi.org.nz/mailman/listinfo/delphi Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: unsubscribe
Re: [DUG] Upgrading to XE - Unicode strings questions
As I understand it iterating over a string with Chars does get around the problem of surrogate pairs It depends what you mean by get around the problem. for c in string do WorkWith( c ); Will iterate once for each c (WIDECHAR) in s. Some of those c's may be in surrogate pairs, but you will get only 1 of each half of each pair at a time. So if your WorkWith() routine simply ignores surrogate pairs then yes, you got around the problem. But if WorkWith() needs to work on discrete codepoints beyond the BMP then you have some extra work to do before you can call WorkWith(), and you must call it with a UTF32 parameter, NOT a UTF16 WideChar (unless WorkWith() has some way of keeping track of calls made to it, and doing the job of combining surrogates for itself - which is unlikely I think). But crucially, for c in s is absolutely no different from: for i := 1 to Length(s) do WorkWith( s[i] ); They do exactly the same thing - namely iterate over each widechar in the string. as any character you are currently on might be either 1,2 or more bytes if it contains surrogate pairs, but just one unicode character This makes no sense. *Every* character (WIDECHAR) that you are on will be 2 bytes. No more. No Less. The number of the bytes shall be 2, and 2 shall be the number. What those 2 bytes represent may be either a complete Unicode codepoint (in the BMP) or one of either a hi/lo char in a surrogate pair, which must be combined to derive the codepoint they represent. what do you use instead of length to get the number of characters in the string in general? Length(s) returns the number of WIDEChars. The number of n for which s[n] is valid. length is not the number of characters, its the umber of code-points (including surrogate pairs counted separately) if I understand correctly. Nope - you understand incorrectly. J Separate issue - I understand that if one wants to iterate over the bytes of a string then one uses byte rather than char, and then one does have to investigate each byte to see if it is part of a surrogate pair. No, this is what you have to do with WideChars in a string. You use bytes if you don't care about the characters at all and simply want to work with the raw byte data. Unlikely in the context of the questions you are asking here, I would add. ___ NZ Borland Developers Group - Delphi mailing list Post: delphi@delphi.org.nz Admin: http://delphi.org.nz/mailman/listinfo/delphi Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: unsubscribe
Re: [DUG] Upgrading to XE - Unicode strings questions
Hi John Extra question: It looks like code like for i:=1 to length(string1) do begin DoSomethingWithOneChar(string1[i]); end; cannot be used reliably. I think the solution here is not to concentrate on unicode vs widechar vs ansichar, but rather on what DoSomethingWithOneChar() is actually trying to achieve. Does the function even make sense for non-ANSI characters? Only a more concrete example can be discussed with meaning. Todd. The problems are that length(string1) looks like it cannot be safely used - as unicode characters may include 2 codepoints and length(string1) highlights that there is a difference between the number of unicode characters in a string and the number of codepoints. Still figuring out what is the best practice here, as I have quite a lot of string routines. Should be be OK as long as the unicode text actually is ASCII. Q2 – With XE do the .pas and .dfm files become unicode text and hence cannot be read by earlier Delphi, eg D2007 any more? Answer - Is a project option from what I have read?, yes not portable if unicode. Q3 – I do a lot of reading ascii data files, and writing back. Using mainly TFilestream and stringlists. Does this in general mean I will need to use file variables declared as Ansichar and AnsiString instead of Char and String? (I would prefer to use the standard VCL where possible) If I have variables as1:Ansistring; s2:string; Q4 – if I do s2:=as1 does this convert ansistrings to unicode? Answer - yes, there are performance issues to watch out for if conversion happens a lot. Q5 – if I do as1:=s2 does this convert a unicode string to ansistring? (otherwise how do I do this?) Answer - yes, there are performance issues to watch out for if conversion happens a lot. Q6 – I understand any code like char1:=string1[i]; if char1 in [‘a’..’z’] then begin message:=string[i]+’ - character is lowercase’; end will break, as ansi characters are ordinal (less than 256 or 512) and set comparisons ['a'..'z'] or ['a','b','c']can be used, this set code cannot be used for unicode characters. What is the replacement? Answer - There is CharInSet call and numerous extra housekeeping functions added in TCharacter. Q7 – do literals like #13#10 still mean carriage return and linefeed? #9 means tab? if I have code like (logline string1 string2 are string) logline:=FormatDateTime(‘dd-mmm- hh:nn:ss’,now) + string1 + #13#10+#9 + string2; ShowMessage(logline); Button1.hint:=logline; writeln(f,logline); these work D5-D2007 - ie a 2 line messagebox text, 2 line hint, and 2 lines written to a log file. is this still going to work? do carriage returns/tabs/other control characters have to be defined differently, eg as constants? Answer - not figured out yet - anyone else know? Q8 – stringlist1.loadfromfile(‘Test1.txt’); what happens if this file is ascii text being read into a stringlist which is unicode strings. Answer - Default is Ascii text for loadfromfile and savetofile, use overloaded routines for Unicode Q9 - stringlist1.savetofile(‘Test1.txt’) presumably this is no longer ascii text. How do I save and read a stringlist to/from a file if it is to be Ansi text? Q10 – If there are complexities in Q8 and Q9 is there a TAnsiStringlist type (for ansistrings) as well as a unicode TStringlist type? (I use stringlists a lot) Answer - unicodestring lists can save to ascii or unicode files, so TAnsiStringlist not needed. Q11 – do inifiles become unicode too? Answer - looks like no? Not clear? Anyone else know? Q12 – does Windows Notepad open unicode text files correctly? or can it only be used on Ansi text files? Anyone know this? Q13 - It looks like most programmers editors read and write ascii and unicode encoding.the one I use seems to distinguish between UTF-8 and unicode as well – what is the difference? Anyone know this? John ___ NZ Borland Developers Group - Delphi mailing list Post: delphi@delphi.org.nz Admin: http://delphi.org.nz/mailman/listinfo/delphi Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: unsubscribe ___ NZ Borland Developers Group - Delphi mailing list Post: delphi@delphi.org.nz Admin: http://delphi.org.nz/mailman/listinfo/delphi Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: unsubscribe ___ NZ Borland Developers Group - Delphi mailing list Post: delphi@delphi.org.nz Admin: http://delphi.org.nz/mailman/listinfo/delphi Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject:
Re: [DUG] Upgrading to XE - Unicode strings questions
Length( s ) will always yield the number of chars in s. So how does one obtain the number of bytes in a string if one wants to use AnsiChar to check every character? Does s[0] work? Ross. ___ NZ Borland Developers Group - Delphi mailing list Post: delphi@delphi.org.nz Admin: http://delphi.org.nz/mailman/listinfo/delphi Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: unsubscribe
[DUG] Upgrading to XE - Unicode strings questions
Planning upgrading from D2007 to XE, but want to read up on issues I will need to consider first to do with strings becoming Unicode by default. I recall the release of D2009 came with good white papers explaining ramifications, however I haven’t seen these as I haven’t upgraded. Asked for such also at the XE event but have not been sent anything yet. I have a lot of code which I want to plan to be able to recompile easily, and would like to plan this migration. I would prefer to put anything contentious or varying into a library unit, a ‘wrapper’ so that I don’t have to deal with these version differences in the main code... Anyone can answer any of these quick questions please post here or email me – thanks! Q1 - Anyone got some good references to read up on ansistring to unicode issues ? Comprehensive please! Q2 – With XE do the .pas and .dfm files become unicode text and hence cannot be read by earlier Delphi, eg D2007 any more? Q3 – I do a lot of reading ascii data files, and writing back. Using mainly TFilestream and stringlists. Does this in general mean I will need to use file variables declared as Ansichar and AnsiString instead of Char and String? (I would prefer to use the standard VCL where possible) If I have variables as1:Ansistring; s2:string; Q4 – if I do s2:=as1 does this convert ansistrings to unicode? Q5 – if I do as1:=s2 does this convert a unicode string to ansstring? (otherwise how do I do this?) Q6 – I understand any code like char1:=string1[i]; if char1 in [‘a’..’z’] then begin message:=string[i]+’ - character is lowercase’; end will break, as ansi characters are ordinal (less than 256 or 512) and set comparisons ['a'..'z'] or ['a','b','c']can be used, this set code cannot be used for unicode characters. What is the replacement? Q7 – do literals like #13#10 still mean carriage return and linefeed? #9 means tab? if I have code like (logline string1 string2 are string) logline:=FormatDateTime(‘dd-mmm- hh:nn:ss’,now) + string1 + #13#10+#9 + string2; ShowMessage(logline); Button1.hint:=logline; writeln(f,logline); these work D5-D2007 - ie a 2 line messagebox text, 2 line hint, and 2 lines written to a log file. is this still going to work? do carriage returns/tabs/other control characters have to be defined differently, eg as constants? Q8 – stringlist1.loadfromfile(‘Test1.txt’); what happens if this file is ascii text being read into a stringlist which is unicode strings. Q9 - stringlist1.savetofile(‘Test1.txt’) presumably this is no longer ascii text. How do I save and read a stringlist to/from a file if it is to be Ansi text? Q10 – If there are complexities in Q8 and Q9 is there a TAnsiStringlist type (for ansistrings) as well as a unicode TStringlist type? (I use stringlists a lot) Q11 – do inifiles become unicode too? Q12 – does Windows Notepad open unicode text files correctly? or can it only be used on Ansi text files? Q13 - It looks like most programmers editors read and write ascii and unicode encoding.the one I use seems to distinguish between UTF-8 and unicode as well – what is the difference? John ___ NZ Borland Developers Group - Delphi mailing list Post: delphi@delphi.org.nz Admin: http://delphi.org.nz/mailman/listinfo/delphi Unsubscribe: send an email to delphi-requ...@delphi.org.nz with Subject: unsubscribe