On Sun, 15 Apr 2012 07:16:15 +0200 Martin Schreiber <mse00...@gmail.com> wrote:
> On Saturday 14 April 2012 22:36:02 Marcos Douglas wrote: > > > > Well, works if I change this line: > > fname := 'c:\á b ç\á.txt'; > > to this: > > fname := UTF8Decode('c:\á b ç\á.txt'); > > > > And doesn't matter if fname is UnicodeString or string -- well, the > > debug hint to 'UnicodeString' is more beautiful than 'string' because > > the compiler translate. > > > Add {$codepage utf8} to the unit header or compile with -Fcutf8, this is the > default setting in MSEide+MSEgui for automatic Unicode handling. > Warning: most likely this setting will break Lazarus on FPC 2.6.0. I don't > know if Lazarus is fully tested with cpstrnew and -Fcutf8 already. cpstrnew is part of fpc 2.7.1 and Lazarus runs fine with it since two months. The -Fcutf8 and {$codepage utf8} exists since ages. There are some traps with -Fcutf8 and {$codepage utf8}. It only works if the RTL DefaultSystemCodePage is CP_UTF8. Otherwise your strings are converted by the compiler. For example under Linux the RTL default is CP_ACP, which defaults to ISO_8859-1. The RTL does *not* read your environment language on its own. So the the default is ISO_8859-1. This means your UTF-8 string constants are converted by the compiler: Compile this with -Fcutf8 and run it on a Linux with LANG set to utf-8: program project1; {$mode objfpc}{$H+} begin writeln(DefaultSystemCodePage,' ',CP_UTF8); writeln('ä'); end. This results in 0 65001 ä The LCL uses a widestringmanager (at the moment cwstring), which sets the DefaultSystemCodePage. You can do the same in your non LCL programs: program project1; {$mode objfpc}{$H+} uses cwstring; begin writeln(DefaultSystemCodePage,' ',CP_UTF8); writeln('ä'); end. This results in 65001 65001 ä The above is a lie for kids. See the program below: program project1; {$mode objfpc}{$H+} uses cwstring; var a,b,c: string; begin writeln(ord(DefaultSystemCodePage),' ',CP_UTF8); a:='ä'; b:='='#$C3#$A4; // #$C3#$A4 is UTF-8 for ä c:= 'ä='#$C3#$A4; writeln(a,b); // writes ä=ä writeln(c); // writes ä=ä end. You can see that an UTF-8 string constant works, a string constant with UTF-8 codes works too, but the combination does not work. The above was compiled with -Fcutf8 and uses cwstring to set the DefaultSystemCodePage to CP_UTF8. So what went wrong? The compiler treats any non ascii string constant (here: the ä) as widestring (not UTF-16). You can not fool the compiler with 'ä='+#$C3#$A4. You must define two separate string constants. Using any character outside the UCS-2 range results in Fatal: illegal character "'�'" ($F0) You can specify them with UTF-16 codes: #$D834#$DD1E Yes, you read right. Specifying the codepage with -Fcutf8 or {$codepage utf8} actually defines a mix of UTF-8 and UTF-16. Now compile the above without -Fcutf8: 65001 65001 ä=ä ä=ä Wow, everything looks as expected. You can even mix ascii and non ascii string constants. Without the codepage the compiler stores string constants as byte sequences. That's what UTF-8 is. That's why the LCL applications do not use the codepage flags. msegui has implemented an ecosystem of widestrings, so it works better with a codepage. Mattias -- _______________________________________________ Lazarus mailing list Lazarus@lists.lazarus.freepascal.org http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus