Re: [DUG] Upgrading to XE - Unicode strings questions

Jolyon Smith Mon, 22 Nov 2010 16:44:30 -0800

I'm guessing my response to your previous email didn't come thru for some 
reason - resending:



I shall address some of your questions that I can answer quickly:


Q2 – With XE do the .pas and .dfm files become unicode text and hence 
      cannot be read by earlier Delphi, eg D2007 any more?

I forget precisely which version of the IDE introduced the change, but the IDE 
has for some time supported different encodings for source/DFM files.  
Certainly this was present in D2006 and it may even have been as far back as D7 
or even earlier that it was introduced.

(Right click in source/dfm file and choose "File Format" from the context menu 
to see/change the file encoding)



Q3 – I do a lot of reading ascii data files, and writing back.   Using 
mainly TFilestream and stringlists.

Which TFileStream you should be OK, as long as you read/write into 
ANSIString/ANSIChar buffers as you already surmised.

With TStringList you are forced to push your data through a Unicode/ANSI 
conversion when reading/writing from/to ANSI files, since the TStringList 
itself holds UnicodeString items.  You can do this using the new "Encoding" 
parameter to the relevant methods of the class to ensure you read/write the 
correct/expected encoding (reading should correctly detect the encoding, but 
when writing you will need to be explicit).


Q4 – if I do s2:=as1  does this convert ansistrings to unicode?

Q5 – if I do as1:=s2 does this convert a unicode string to ansstring?

Yes, but you will get a warning when going from Unicode to ANSI (since not all 
ANSI encodings will support the possible content of a Unicode string).  To 
avoid this, be explicit with the conversion.



Q6 – I understand any code like

            char1:=string1[i];
            if char1 in [‘a’..’z’] then
            begin
                    message:=string[i]+’ - character is lowercase’;
            end

        will break.


Nope, it's fine.  But again, you will get a warning, in this case that the 
WIDECHAR has been reduced to a BYTE (NOTE: not converted to ANSICHAR) and a 
suggestion that you use CharInSet() instead.

Note however that CharInSet contains no real "magic" that makes sets work for > 
255 elements - it merely provides a wrapper around code that will avoid the 
suggestion that you use CharInSet().  You can achieve the same effect by again 
simply being explicit that you know that what you are doing is intended and 
safe by reducing the WideChar to an ANSIChar yourself:

  if ANSICHAR(char1) in ['a'..'z'] then

To my mind this is preferable to using CharInSet() as it makes it clearer in 
the code what is going on (that non-ANSIChars are not expected and may not be 
handled as intended).  Using CharInSet() won't make any material difference to 
the behaviour of the code, but it would make it less apparent what is going on 
(i.e. that your code deals specifically with ANSI chars).

CharInSet() performs a test for the Char being (C < #$0100), but if your code 
is dealing with ANSI chars packaged in Unicode strings then this test is 
redundant, and using CharInSet() hides the intent of your code - to deal 
specifically with ANSI.

That is just my preference however.  Ymmv.



Q7 – do literals like  #13#10 still mean carriage return and linefeed?  #9 
means tab?

Yes.  But one thing to be aware of is that #nnn won't necessarily yield an 
ANSIChar(nnn).



Q8 – stringlist1.loadfromfile(‘Test1.txt’);
        what happens if this file is ascii text being read into a stringlist 
which is unicode strings.

The stringlist will contain UnicodeStrings, converted from the ASCII file 
content that was loaded.


Q9 -   stringlist1.savetofile(‘Test1.txt’)
         presumably this is no longer ascii text.

It won't be ASCII (but technically it never was :)) it will be ANSI unless you.



Q9a - How do I save and read a stringlist to/from a file if it is to be Ansi 
text?

As you would have done before.  It is if you want to save to something other 
than ANSI that you have to invoke the Encodings parameter, for example to save 
as UTF8:

  strings.SaveToFile(filename, TEncoding.UTF8);


Q10 – If there are complexities in Q8 and Q9 is there a TAnsiStringlist type 
(for ansistrings) as well as a unicode TStringlist type?

NOPE!   A shocking omission imho.


Q12 – does Windows Notepad open unicode text files correctly?

Yep – and it’s a handy tool for testing variations in encoding (In Notepad when 
you “Save As” you can choose the encoding: ANSI, UTF8, BE Unicode or LE Unicode 
(here “Unicode” = UTF16).

When you “Save As” a file that you previously opened, the default encoding 
selected will reflect the encoding of the file when it was opened.



Q13 - It looks like most programmers editors read and write ascii and unicode 
encoding.....the one I use seems to distinguish between UTF-8 and unicode as 
well – what is the difference?

UTF-8 *is* Unicode.

Unicode is a character set (technically it is more than that, but for the 
purposes of this explanation that definition will suffice).

UTF8/16/32 are different *encodings* for that character set.  For UTF16 and 
UTF32 there are also Big and Little Endianed variants.

As noted before, in Notepad, and possibly in other apps, the term “Unicode” 
denotes “UTF16”.

UTF32 is rarely encountered in the wild, which might explain why there is no 
TEncoding support for it (and indeed why Notepad doesn’t appear to support it).


As far as the difference between ASCII and UTF8 encoded Unicode goes:

An ASCII file can represent only characters 0..128 and each character is 
certain to occupy a single byte.

A UTF8 file can represent *EVERY* Unicode character, not just ASCII, but 
characters with codepoints > 127 will occupy 2 or more bytes.

You may have spotted that for an ASCII file, ASCII and UTF8 encoding are 
physically indistinguishable at the character data level.  However, a *true* 
UTF8 file (as opposed to an ASCII file that could be treated naively as UTF8 – 
or vice versa) will have a BOM (Byte Order Marker).

A BOM is a sequence of bytes that is prepended to a file (or stream) to 
indicate the Unicode encoding and identify the byte order for those encodings 
that have big/little endian variants.



I hope that all helps a little.


:-)






-----Original Message-----
From: [email protected] [mailto:[email protected]] On 
Behalf Of John Bird
Sent: Tuesday, 23 November 2010 13:04
To: NZ Borland Developers Group - Delphi List
Subject: Re: [DUG] Upgrading to XE - Unicode strings questions

Thanks for the references, so I can answer most of the questions now. 
Here is what I understand so far, if anyone has anything to add this will be 
useful!

Extra question:

It looks like code like

    for i:=1 to length(string1) do
    begin
            DoSomethingWithOneChar(string1[i]);
    end;

cannot be used reliably.   The problems are that length(string1) looks like 
it cannot be safely used - as unicode characters may include 2 codepoints 
and length(string1) highlights that there is a difference between the number 
of unicode characters in a string and the number of codepoints.   Still 
figuring out what is the best practice here, as I have quite a lot of string 
routines.   Should be be OK as long as the unicode text actually is ASCII.

Q2 – With XE do the .pas and .dfm files become unicode text and hence cannot
be read by earlier Delphi, eg D2007 any more?

Answer - Is a project option from what I have read?, yes not portable if 
unicode.

Q3 – I do a lot of reading ascii data files, and writing back.   Using
mainly TFilestream and stringlists.   Does this in general mean I will need
to use file variables declared as Ansichar and AnsiString instead of Char
and String?
(I would prefer to use the standard VCL where possible)

If I have variables
        as1:Ansistring;
        s2:string;

Q4 –         if I do s2:=as1  does this convert ansistrings to unicode?

Answer - yes, there are performance issues to watch out for if conversion 
happens a lot.

Q5 – if I do as1:=s2 does this convert a unicode string to ansistring?

    (otherwise how do I do this?)

Answer - yes, there are performance issues to watch out for if conversion 
happens a lot.

Q6 – I understand any code like

            char1:=string1[i];
            if char1 in [‘a’..’z’] then
            begin
                    message:=string[i]+’ - character is lowercase’;
            end

        will break, as ansi characters are ordinal (less than 256 or 512)
and set comparisons ['a'..'z']  or ['a','b','c']    can be used, this set
code cannot be used for unicode characters.   What is the replacement?

Answer - There is CharInSet call and numerous extra housekeeping functions 
added in TCharacter.

Q7 – do literals like  #13#10 still mean carriage return and linefeed?  #9
means tab?
        if I have code like (logline string1 string2 are string)

        logline:=FormatDateTime(‘dd-mmm-yyyy hh:nn:ss’,now) + string1 +
#13#10+#9 + string2;
        ShowMessage(logline);
        Button1.hint:=logline;
        writeln(f,logline);

        these work D5-D2007   - ie a 2 line messagebox text, 2 line hint,
and 2 lines written to a log file.
        is this still going to work?

        do carriage returns/tabs/other control characters have to be defined
differently, eg as constants?

Answer - not figured out yet - anyone else know?

Q8 – stringlist1.loadfromfile(‘Test1.txt’);
        what happens if this file is ascii text being read into a stringlist
which is unicode strings.

Answer - Default is Ascii text for loadfromfile and savetofile, use 
overloaded routines for Unicode

Q9 -   stringlist1.savetofile(‘Test1.txt’)
         presumably this is no longer ascii text.   How do I save and read a
stringlist to/from a file if it is to be Ansi text?

Q10 – If there are complexities in Q8 and Q9 is there a TAnsiStringlist
type (for ansistrings) as well as a unicode TStringlist type?
        (I use stringlists a lot)

Answer - unicodestring lists can save to ascii or unicode files, so 
TAnsiStringlist not needed.

Q11 – do inifiles become unicode too?

Answer - looks like no?  Not clear?  Anyone else know?

Q12 – does Windows Notepad open unicode text files correctly?   or can it
only be used on Ansi text files?

Anyone know this?

Q13 - It looks like most programmers editors read and write ascii and
unicode encoding.....the one I use seems to distinguish between UTF-8 and
unicode as well – what is the difference?

Anyone know this?

John

_______________________________________________
NZ Borland Developers Group - Delphi mailing list
Post: [email protected]
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to [email protected] with Subject: 
unsubscribe 

_______________________________________________
NZ Borland Developers Group - Delphi mailing list
Post: [email protected]
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to [email protected] with Subject: 
unsubscribe

_______________________________________________
NZ Borland Developers Group - Delphi mailing list
Post: [email protected]
Admin: http://delphi.org.nz/mailman/listinfo/delphi
Unsubscribe: send an email to [email protected] with Subject: 
unsubscribe

Re: [DUG] Upgrading to XE - Unicode strings questions

Reply via email to