Re: [fpc-devel] String and UnicodeString and UTF8String
Sven Barth schrieb: Widestring will also grind the application to a halt due to being COM based on Windows. How that? WideString on Windows has no reference counting, thus everytime a WideString is assigned it needs to be copied. I'm not so sure of that. AFAIR the field exists, but it's unused or reserved for shared memory management. Of course the requirement, that a BSTR has to reside in shared memory, discourages the use of exactly that type for stringhandling inside an application. I only wanted to prevent the introduction of another UTF16String type, in addition to WideString, BSTR (WinAPI) and UnicodeString (Delphi). Conversion-wise WideString/BSTR and (other) UTF-16 strings are equivalent. Nearly all Windows API functions only allow single byte encodings or UTF-16. The only functions that I'm aware of, that can use UTF-8 encoding is the console input/output API (if the codepage is set to UTF-8) [and also file I/O APIs, but they don't assume any encoding]. And the conversion functions of course (MBCStoWStr...). DoDi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String and UnicodeString and UTF8String
On 14-1-2011 13:21, Hans-Peter Diettrich wrote: Sven Barth schrieb: Widestring will also grind the application to a halt due to being COM based on Windows. How that? WideString on Windows has no reference counting, thus everytime a WideString is assigned it needs to be copied. I'm not so sure of that. AFAIR the field exists, but it's unused or reserved for shared memory management. Yes, if you use the set of memory allocators I mentioned the field *will* be used. COM marshalling. No com, no count. simple as that. It is unused, because the memory manager doesn't use it. com is not implemented, unless you use a com based memory manager. No com, no reference count. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String and UnicodeString and UTF8String
On Wednesday, 12. January 2011 23.05:02 Juha Manninen wrote: Martin Schreiber kirjoitti maanantai 10 tammikuu 2011 19:22:49: On Monday, 10. January 2011 16.27:19 Marco van de Voort wrote: And there are three such cases - normal FPC and Delph 2007- code : ansistring(0) - Lazarus : ansistring=utf8 - Delphi 2009+ UTF16. - fpGUI: ansistring = utf-8 - MSEgui: existing FPC UnicodeString = utf-16 Without studying your code myself I guess you had to make many utility functions and classes yourself for UTF-16 ? Even the normal TStringList doesn't work. Correct. MSEgui has a complete development environment for UnicodeString with an own set of lists, streams, file and directory functions and the like. Martin ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String and UnicodeString and UTF8String
In our previous episode, Hans-Peter Diettrich said: non-native strings, it can also be a performance win). IMO a single encoding, i.e. UTF-8, can cover all cases. Well, for starters, it doesn't cover the existing Delphi/unicode codebase. Because it's bound to UTF-16? That's not a problem, because WideString will continue to exist, and according conversions are still inserted by the compiler. That is DIY compatibility, or, in other words, no compaibility. Widestring will also grind the application to a halt due to being COM based on Windows. While some hard core Ansi coders may whine about such a convention, the absence of implicit string conversions (except in external library calls) will make such applications more performant than mixed-encoding versions. I don't see why this is the case. A current system encoding application does not do any conversion. (except for GUI output, and that can be considered negiable to the actual GUI overhead) When system encoding changes with the target platform, indexed access to such strings can lead to different results. Unless the compiler can read the coder's mind... You don't have to. The Delphi model provides a stringtype for the system encoding, and then as such all strings from the system can be labeled. With other stringtypes, the necessary conversions can be edited. Likewise, e.g. win32 console routines can be labeled with OEMString. (Since windows uses a different default encoding for the console) Why spend time in the design of multiple RTL/LCL versions, when a single version will be perfectly sufficient? Why spent 13 years being compatible when you can throw it away in a second? It's sufficient to throw away what's no more needed :-) The previous message from Jeff shows that even shortstring is still in major production use. Nothing is unused and can be clipped without a long winded transition, or Delphi 2009 like painful breaks. Moreover, these discussions are useless since you know as well as I do that no one stringtype will ever satisfy everybody. So IMHO it is time to take the consequences from the 500 posts on this subject on the unicode subject on this and other FPC/Lazarus lists and start thinking in solutions to manage that, instead of reiterating the one type to rule them all mantra ad infinitum. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String and UnicodeString and UTF8String
Marco van de Voort schrieb: In our previous episode, Hans-Peter Diettrich said: non-native strings, it can also be a performance win). IMO a single encoding, i.e. UTF-8, can cover all cases. Well, for starters, it doesn't cover the existing Delphi/unicode codebase. Because it's bound to UTF-16? That's not a problem, because WideString will continue to exist, and according conversions are still inserted by the compiler. That is DIY compatibility, or, in other words, no compaibility. I still don't understand the problem :-( Widestring will also grind the application to a halt due to being COM based on Windows. How that? When system encoding changes with the target platform, indexed access to such strings can lead to different results. Unless the compiler can read the coder's mind... You don't have to. The Delphi model provides a stringtype for the system encoding, and then as such all strings from the system can be labeled. With other stringtypes, the necessary conversions can be edited. Indexed string access produces other results for Ansi and UTF-8 system encoding. Such code is not portable, and the data (ini files) are not, too. Allowing for UTF-8 as the system encoding will frustrate Windows users (dunno whether Windows allows for such a system encoding), and Linux users are frustrated when UTF-8 is disallowed. Only solution: using OS encoding restricts the code to run on a single machine only, or on similarly configured machines. The group of users, which accept this restriction, will be happy with a single AnsiString type and no implicit conversions. Without implicit conversions such a string type can hold UTF-8 as well. Likewise, e.g. win32 console routines can be labeled with OEMString. (Since windows uses a different default encoding for the console) This either implies OEM encoding as the system encoding of Win32 console applications, or the use of multiple codepages, as before. But IMO Win32 console also implements a W interface, so that it's up to the user to use whatever is more appropriate for his code. The RTL has to distinguish between system-wide filesystem and GUI encoding, in file handling (CreateFile...). Why spend time in the design of multiple RTL/LCL versions, when a single version will be perfectly sufficient? Why spent 13 years being compatible when you can throw it away in a second? It's sufficient to throw away what's no more needed :-) The previous message from Jeff shows that even shortstring is still in major production use. Nothing is unused and can be clipped without a long winded transition, or Delphi 2009 like painful breaks. It's all about the well known dilemma: - force (possibly many) implicit conversions, or - supply multiple RTL/LCL versions, or - break legacy user code by moving to a different (but again unique) string type. Moreover, these discussions are useless since you know as well as I do that no one stringtype will ever satisfy everybody. So IMHO it is time to take the consequences from the 500 posts on this subject on the unicode subject on this and other FPC/Lazarus lists and start thinking in solutions to manage that, instead of reiterating the one type to rule them all mantra ad infinitum. The discussion is only about the pros and cons of the various possible solutions. I.e. it should reveal the critical cases and consequences, that have to be considered and handled in every implementation. The implementation can choose any model. Different models can be implemented as well, so that the final decision about the new standard can be delayed, until the models can be tested in real world applications. One model has already been implemented: UTF-8. It may need some adds/improvements, like a *hard* separation of AnsiString from UTF8String, and nothing has to be thrown away. DoDi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String and UnicodeString and UTF8String
On 12.01.2011 22:40, Marco van de Voort wrote: In our previous episode, Sven Barth said: legacy code can be broken by (eventually) required changes to set of char, sizeof(char) and PChar, sizeof(string) as opposed to Length(string), upper/lower conversion, and many more not so obvious consequences. I don't believe that PChar will be touched, because to much code that interfaces with C code depends on that. Although its declaration might not be the same then and become PChar = PAnsiChar instead of PChar = ^Char if Char is changed (currently its PAnsiChar = PChar). Current Delphi _does_ regard char as equivalent lowlevel type to string. So whatever you choose as string (8 or 16-bit), pchar will match it by changing to pansichar or pwidechar Oh come on -.- There are some days on which I really dislike the developers of Delphi... Regards, Sven ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String and UnicodeString and UTF8String
On 13.01.2011 18:57, Hans-Peter Diettrich wrote: Widestring will also grind the application to a halt due to being COM based on Windows. How that? WideString on Windows has no reference counting, thus everytime a WideString is assigned it needs to be copied. When system encoding changes with the target platform, indexed access to such strings can lead to different results. Unless the compiler can read the coder's mind... You don't have to. The Delphi model provides a stringtype for the system encoding, and then as such all strings from the system can be labeled. With other stringtypes, the necessary conversions can be edited. Indexed string access produces other results for Ansi and UTF-8 system encoding. Such code is not portable, and the data (ini files) are not, too. Allowing for UTF-8 as the system encoding will frustrate Windows users (dunno whether Windows allows for such a system encoding), and Linux users are frustrated when UTF-8 is disallowed. Nearly all Windows API functions only allow single byte encodings or UTF-16. The only functions that I'm aware of, that can use UTF-8 encoding is the console input/output API (if the codepage is set to UTF-8) [and also file I/O APIs, but they don't assume any encoding]. Regards, Sven ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String and UnicodeString and UTF8String
On 13-1-2011 21:40, Sven Barth wrote: WideString on Windows has no reference counting, thus everytime a WideString is assigned it needs to be copied. Not exactly true. widestring is com marshaled and thus has reference counting on the com level. afaik . As long as your memorymanager is com marshaled too, that is. And since most pascal memory manager versions do not support com directly, it goes wrong in a big way. I once wrote a simple com memory manager to test this. Performance stays sh*t, but strings seem to be counted, not copied. If you use coTaskMemAlloc, coTaskMemFree,CoTaskMemRealloc in your memory manager you will see what I mean. At least it comes close, but slow it will stay. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String and UnicodeString and UTF8String
On Thursday, 13. January 2011 18.57:00 Hans-Peter Diettrich wrote: The implementation can choose any model. Different models can be implemented as well, so that the final decision about the new standard can be delayed, until the models can be tested in real world applications. One model has already been implemented: UTF-8. It may need some adds/improvements, like a *hard* separation of AnsiString from UTF8String, and nothing has to be thrown away. Another already implemented model is utf-16 UnicodeString in MSEgui. Needs no changes in Free Pascal compiler. Martin ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String and UnicodeString and UTF8String
On 01/11/2011 05:50 PM, Hans-Peter Diettrich wrote: Since the generic Delphi string type can be any Unicode encoding now, This From what O read I understand that the dynamically code string type can hold 1, 2, and 4 byte (maybe even more) Codes for it's elements (denoted in one control-value) and each of those (theoretically) in different coding schemes (denoted in another control-value), allowing e.g. for UTF-8, UTF-16, UCS4, German ANSI, raw Byte, string is what I (not owning a Delphi 2007) thought, too, and have been bashed for. But The document Delphi and Unicode by Marco Cantu ( http://edn.embarcadero.com/article/images/38980/Delphi_and_Unicode.pdf ), dated Nov, 2008, in fact states: length, the second element is the reference count. In Delphi 2009 the representation for reference-counted strings becomes: -12-10 -8-4 String reference address Code pageElem sizeRef countlength First char of string Beside the length and reference count, the new fields represent the element size and the code page. While the element size is used to discriminate between AnsiString and UnicodeString, the code page makes sense in particular for the AnsiString type (as it works in Delphi 2009), as the UnicodeString type has the fixed code page 1200. A corresponding support data structure is declared in the implementation section of System unit as: type PStrRec = ^StrRec; StrRec = packed record codePage: Word; elemSize: Word; refCnt: Longint; length: Longint; end; But maybe the document is outdated. -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String and UnicodeString and UTF8String
On 01/11/2011 05:19 PM, Hans-Peter Diettrich wrote: IMO a single encoding, i.e. UTF-8, can cover all cases. Of course you are right here, but there are some things to be considered: In Windows (and maybe elsewhere, too) a two-Byte API (e.g. UTF-16) needs to be used, forcing lots of conversions when doing GUI applications. _All_ beginners will use s[i] and expect to get a character without any afterthought. They will be very disappointed when not using English if they get bytes instead of characters. The count of the frustrated will be much smaller (but Zero) when doing Widestring/Widechar and they get Words instead of Characters. Eliminating the s[i] syntax would trash a lot of legacy code and the decent replacement (finding the correct character and moving it into a DWord in UCS4) is slow and still does not handle all the funny Unicode character-combining stuff. But the count of frustrated beginners might be further reduced. -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String and UnicodeString and UTF8String
Jeff Wormsley schrieb: On 01/11/2011 11:10 AM, Hans-Peter Diettrich wrote: UTF-8 combines an single (byte-based) storage type with lossless encoding of full Unicode. Ansi and UCS2 (really UTF-16) only *look* easier to handle in user code, but both will fail and require special code whenever characters outside the assumed codepage may occur. Preface: I don't write international apps, and probably won't for the foreseeable future... Then you may be bound to some legacy compiler version when the stringhandling will change in some future time, as happened to Delphi users. Continued support of AnsiString type(s) is not enough, because legacy code can be broken by (eventually) required changes to set of char, sizeof(char) and PChar, sizeof(string) as opposed to Length(string), upper/lower conversion, and many more not so obvious consequences. Isn't all of this concentration on trying to make strings have single byte characters (who cares how they are encoded), using the argument that it is somehow faster, incorrect for just about any modern processor, including embedded CPU's such as ARM? It was my understanding that 32 bit aligned access was always faster than byte aligned access on just about any CPU FPC still supports. See Marco's comment about data size etc. The argument holds just fine for memory, but I don't really get the speed argument. Maybe I'm missing something. FPC (the compiler) still uses ShortStrings wherever possible, because that was found out as the most efficient string representation. This is partially due to the ASCII encoding of source code, except for string literals. But like you, I'm not sure that this argument still holds on modern hardware. Speed loss may occur due to: - data shuffling in general (total byte count) - (implied) string conversion - indexed access to MBCS[1] strings (including UTF-8/16) [1] All encodings of variable character size discourage indexed access to strings. Then char can have multiple meanings, as either representing the (physical) string/array *element* size, or the (logical) size of an *codepoint*. Until now most users, including you, most probably don't realize that difference between phyiscal and logical characters, and assume that sizeof(char) always is 1, and eventually that sizeof(WideChar) is 2. IMO variables of type char should have at least 3 (better 4) bytes in an Unicode environment, in order to maintain the correspondence between physical and logical characters. As already suggested the packed keyword could be applied to strings and char arrays, to definitely signal to the user that indexed access should not be used with such variables, unless a speed penalty is acceptable. DoDi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String and UnicodeString and UTF8String
On 12.01.2011 13:38, Hans-Peter Diettrich wrote: Jeff Wormsley schrieb: On 01/11/2011 11:10 AM, Hans-Peter Diettrich wrote: UTF-8 combines an single (byte-based) storage type with lossless encoding of full Unicode. Ansi and UCS2 (really UTF-16) only *look* easier to handle in user code, but both will fail and require special code whenever characters outside the assumed codepage may occur. Preface: I don't write international apps, and probably won't for the foreseeable future... Then you may be bound to some legacy compiler version when the stringhandling will change in some future time, as happened to Delphi users. Continued support of AnsiString type(s) is not enough, because legacy code can be broken by (eventually) required changes to set of char, sizeof(char) and PChar, sizeof(string) as opposed to Length(string), upper/lower conversion, and many more not so obvious consequences. I don't believe that PChar will be touched, because to much code that interfaces with C code depends on that. Although its declaration might not be the same then and become PChar = PAnsiChar instead of PChar = ^Char if Char is changed (currently its PAnsiChar = PChar). Regards, Sven ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String and UnicodeString and UTF8String
In our previous episode, Hans-Peter Diettrich said: memory management and the occasional code page conversion (and since this may reduce the number of code page conversions when working with non-native strings, it can also be a performance win). IMO a single encoding, i.e. UTF-8, can cover all cases. Well, for starters, it doesn't cover the existing Delphi/unicode codebase. While some hard core Ansi coders may whine about such a convention, the absence of implicit string conversions (except in external library calls) will make such applications more performant than mixed-encoding versions. I don't see why this is the case. A current system encoding application does not do any conversion. (except for GUI output, and that can be considered negiable to the actual GUI overhead) Why spend time in the design of multiple RTL/LCL versions, when a single version will be perfectly sufficient? Why spent 13 years being compatible when you can throw it away in a second? ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String and UnicodeString and UTF8String
In our previous episode, Sven Barth said: legacy code can be broken by (eventually) required changes to set of char, sizeof(char) and PChar, sizeof(string) as opposed to Length(string), upper/lower conversion, and many more not so obvious consequences. I don't believe that PChar will be touched, because to much code that interfaces with C code depends on that. Although its declaration might not be the same then and become PChar = PAnsiChar instead of PChar = ^Char if Char is changed (currently its PAnsiChar = PChar). Current Delphi _does_ regard char as equivalent lowlevel type to string. So whatever you choose as string (8 or 16-bit), pchar will match it by changing to pansichar or pwidechar ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String and UnicodeString and UTF8String
On Wed, Jan 12, 2011 at 7:38 AM, Hans-Peter Diettrich drdiettri...@aol.com wrote: Until now most users, including you, most probably don't realize that difference between phyiscal and logical characters, and assume that sizeof(char) always is 1 Oh, I'm aware of it. But to date, I haven't had to really deal with it in Delphi or FPC. My use of strings is either ancient legacy (from TP/BP days) where I simply changed all references to string to shortstring or low level Windows API code, where I'm dealing with PChar. I find these discussions fascinating, but as they say in the southern US, I don't have a dog in this hunt. Whatever the decision, I'll probably continue to use shortstring. Jeff. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String and UnicodeString and UTF8String
Marco van de Voort schrieb: In our previous episode, Hans-Peter Diettrich said: memory management and the occasional code page conversion (and since this may reduce the number of code page conversions when working with non-native strings, it can also be a performance win). IMO a single encoding, i.e. UTF-8, can cover all cases. Well, for starters, it doesn't cover the existing Delphi/unicode codebase. Because it's bound to UTF-16? That's not a problem, because WideString will continue to exist, and according conversions are still inserted by the compiler. While some hard core Ansi coders may whine about such a convention, the absence of implicit string conversions (except in external library calls) will make such applications more performant than mixed-encoding versions. I don't see why this is the case. A current system encoding application does not do any conversion. (except for GUI output, and that can be considered negiable to the actual GUI overhead) When system encoding changes with the target platform, indexed access to such strings can lead to different results. Unless the compiler can read the coder's mind... Why spend time in the design of multiple RTL/LCL versions, when a single version will be perfectly sufficient? Why spent 13 years being compatible when you can throw it away in a second? It's sufficient to throw away what's no more needed :-) DoDi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String and UnicodeString and UTF8String
On 01/10/2011 04:27 PM, Marco van de Voort wrote: And what do we do if e.g. Lazarus changes opinion and goes from utf8 to utf16 on Windows? (e.g. the Delphi/unicode becomes the dominant influx). The current way Lazarus works (UTF-8 in a String Type called ANSIString, as well with Windows as with Linux without any auto-Conversion, introducing funny problems e.g. when just assigning a string constant to a Widestring) does not seem very appropriate. I feel the logical move would be to use the dynamically encoded string type in the LCL API, but there might be some nasty hidden problems (e.g. with var parameters). -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String and UnicodeString and UTF8String
On 01/10/2011 04:27 PM, Marco van de Voort wrote: I think in the planned Embarcadero cross-compile products, string will also be utf-16 on OS X and Linux. Yak, I had hoped that using the dynamically encoded string type nearly everywhere would allow for a great lot of not OS-specific code in the VCL (and LCL) without the need for excessive conversions maintaining the systems' coding (UTF-16 or UTF-8) in and out with GUI-centric user code. I thought this would have been the main reason for introducing the additional complexity of the dynamically encoded string type. -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String and UnicodeString and UTF8String
I think at most two are required for any target: unicodestring (D2009 compatibility), and if really necessary because somehow the unicodestring version causes too much overhead, an ansistring($) version as well. That's only for the classes though, I think most of the base RTL can be simply ansistring($). So if I understand correctly, then UnicodeString and also AnsiString types must be extended that they will hold also information about actual codepage (encoding) of string data they hold. (AFAIK ATM they hold only information about reference count and size and of course data) I am not expert, so I do not understand all aspect/problems which are joined with proper string handling, but some kind of implicit conversions (based on actual encoding of string data) is necessary (ANSI - UTF-8 - UTF-16 - ANSI ... etc.). For example known problem with Euro currency symbol. In Windows is in CurrencyString global variable stored using ANSI codepage, but used in LCL (which expect UTF-8 encoding) without any explicit conversion, what leads to displayng ? instead of € (for example in TDBEdit or TDBGrid) Another problem when displaying character data in data-aware database controls (TDBEdit, TDBGrid). Data-aware controls (LCL) reads data from TField descendatns (FCL) using TField.Text property which returns string (without codepage information is not clear if it is AnsiString or UTF8String or UnicodeString). LCL expect UTF-8 strings, but it is not true in all cases (for example in case of ODBC) -Laco. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String and UnicodeString and UTF8String
Marco van de Voort schrieb: Btw, while looking up rawbytestring I saw this in the Delphi help: Declaring variables or fields of type RawByteString should rarely, if ever, be done, because this practice can lead to undefined behavior and potential data loss. IIRC RawByteString should be used like OpenString, as subroutine argument type only. In contrast to the name, a RawByteString has a variable encoding, i.e. implicit conversions are inserted for every use with other string types. Thus AnyByteString had been a better name for that type, IMO. Delphi does no more support (officially) non-textual data in strings, and TBytes should be used for such data. How will you deal with e.g. Windows? Legacy string=ansistring(0), D2009 is string=utf16 TUnicodestring? Is an Delphi UnicodeString really compatible with an WinAPI WideString/BSTR? AFAIR all BSTRs must reside in shared memory, so that copies are required for every API call. Mainly the question what the classtree will be. The main operating type used in applications. You always need two RTLs for that, since it can be 1 or 2 byte, and even if you fixated it on one byte encodings, rawbytestring would force you to write case statements in each and every procedure. UTF-8 combines an single (byte-based) storage type with lossless encoding of full Unicode. Ansi and UCS2 (really UTF-16) only *look* easier to handle in user code, but both will fail and require special code whenever characters outside the assumed codepage may occur. DoDi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String and UnicodeString and UTF8String
Jonas Maebe schrieb: This has the advantage that you always have all optimal implementations available, regardless of the platform or default string encoding. It does not require extra work because we have to write all those versions also if we want the RTL to be compilable for different default string encodings. And three checks in a case statement are not going to define the performance in a context of atomic reference counting, dynamic memory management and the occasional code page conversion (and since this may reduce the number of code page conversions when working with non-native strings, it can also be a performance win). IMO a single encoding, i.e. UTF-8, can cover all cases. While some hard core Ansi coders may whine about such a convention, the absence of implicit string conversions (except in external library calls) will make such applications more performant than mixed-encoding versions. The argument my characters *always* will be inside my preferred codepage will prove false sooner or later. While it's not up to a programming language to teach people the better way of coding, the required efforts of the FPC/Lazarus developers IMO should have more weight. Why spend time in the design of multiple RTL/LCL versions, when a single version will be perfectly sufficient? DoDi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String and UnicodeString and UTF8String
Jonas Maebe schrieb: And we have to deal with Windows, where the default is UTF16. ... since Delphi 2009 uses (unicode)string everywhere, we need at least also unicode versions. Since the generic Delphi string type can be any Unicode encoding now, it IMO would be legal to use UTF-8 or UTF-32 for it internally, in FPC. Some code, expecting UCS2/BMP text only, may become a bit slower due to according conversions in indexed access to chars, but no other *implicit* conversions will ever occur. Likewise the generic char type could become a 32 bit type, so that it can hold *every* Unicode codepoint. For both string and array of char the packed keyword could be used to distinguish between different bytecount and encoding, where unpacked types contain UTF-32 chars. This would speed up user code with indexed access, in contrast to both UTF-8 and -16 encodings, and it would allow the user to optimize his code for either speed or size. Indexed access to packed types simply could be disallowed, without breaking anything since the default is not packed. Just some more ideas... DoDi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String and UnicodeString and UTF8String
On 01/11/2011 11:10 AM, Hans-Peter Diettrich wrote: UTF-8 combines an single (byte-based) storage type with lossless encoding of full Unicode. Ansi and UCS2 (really UTF-16) only *look* easier to handle in user code, but both will fail and require special code whenever characters outside the assumed codepage may occur. Preface: I don't write international apps, and probably won't for the foreseeable future... Isn't all of this concentration on trying to make strings have single byte characters (who cares how they are encoded), using the argument that it is somehow faster, incorrect for just about any modern processor, including embedded CPU's such as ARM? It was my understanding that 32 bit aligned access was always faster than byte aligned access on just about any CPU FPC still supports. The argument holds just fine for memory, but I don't really get the speed argument. Maybe I'm missing something. Jeff. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String and UnicodeString and UTF8String
In our previous episode, Jeff Wormsley said: encoding of full Unicode. Ansi and UCS2 (really UTF-16) only *look* easier to handle in user code, but both will fail and require special code whenever characters outside the assumed codepage may occur. Preface: I don't write international apps, and probably won't for the foreseeable future... Isn't all of this concentration on trying to make strings have single byte characters (who cares how they are encoded), using the argument that it is somehow faster, incorrect for just about any modern processor, including embedded CPU's such as ARM? It was my understanding that 32 bit aligned access was always faster than byte aligned access on just about any CPU FPC still supports. 1-byte access is always 1-byte aligned, and the memory system is still slower than these kind of issues. And you shuffle a lot of zeroes extra around. But the trouble is also that 2-byte situation doesn't really solve anything, (you still have surrogates and it never will be as simple as it was), and a much bigger problem with legacy (how many two byte data do you get daily, and how much 1 byte?) The argument holds just fine for memory, but I don't really get the speed argument. Maybe I'm missing something. Shoveling twice as much memory around IS the speed argument :-) ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String and UnicodeString and UTF8String
On 01/10/2011 09:12 AM, LacaK wrote: In current Delphi is String synonym for base type UnicodeString UTF-16 AFAIK, in current Delphi (which I don't have) a String is a variable that can contain dynamically coded informations (such as locally coded 8-Bit ANSI, UTF-8, UTF-16, ...) and - of course - know which code it holds. If a string is generated by the VCL from a Window API function, the coding will be UTF-16, though, but if you create a string with some other coding it will be automatically re-coded to UTF16 before sending it into a Windows API function. -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String and UnicodeString and UTF8String
AFAIK, in current Delphi (which I don't have) a String is a variable that can contain dynamically coded informations (such as locally coded 8-Bit ANSI, UTF-8, UTF-16, ...) and - of course - know which code it holds. I understand By default, variables declared as type String are *UnicodeString*.**, that String=UnicodeString See: http://docwiki.embarcadero.com/VCL/en/System.UnicodeString and also http://docwiki.embarcadero.com/RADStudio/en/String_Types#UnicodeString Note alse, that AnsiString holds additional informations about character encoding: The AnsiString http://docwiki.embarcadero.com/VCL/en/System.AnsiString structure contains a 32-bit length indicator, a 32-bit reference count, a 16-bit data length indicating the number of bytes per character, and a 16-bit code page. -Laco. If a string is generated by the VCL from a Window API function, the coding will be UTF-16, though, but if you create a string with some other coding it will be automatically re-coded to UTF16 before sending it into a Windows API function. -Michael ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String and UnicodeString and UTF8String
On 10 Jan 2011, at 09:12, LacaK wrote: In current Delphi is String synonym for base type UnicodeString UTF-16 AFAIU ATM in FPC is String synonym for AnsiString (as in previos versions of Delphi) Are there any plans to change meaning of String type ? (like Delphi to UnicodeString , or UTF8String?) If/when this is done, it will only be with a compiler switch or directive. Are there any plans to intorduce implicit conversions between AnsiStrings (ANSI code page) to UTF8Strings (UTF-8 encoded) or something like this ? That would be part of the general D2009 ansistring support you referred to in your other message. There is an svn branch (cpstrnew) that contains some preliminary work for this functionality, but nobody has worked on it for a long time. Developers interested in working on finishing that functionality are welcome! Jonas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String and UnicodeString and UTF8String
In our previous episode, Jonas Maebe said: In current Delphi is String synonym for base type UnicodeString UTF-16 AFAIU ATM in FPC is String synonym for AnsiString (as in previos versions of Delphi) Are there any plans to change meaning of String type ? (like Delphi to UnicodeString , or UTF8String?) If/when this is done, it will only be with a compiler switch or directive. ( That won't be enough, since that would not change the relevant units and classes to such type. (e.g. tstringlist would remain defined ansistring) For this to work, we probably will have to split targets into UTF16 and ansi. (and maybe multiple ansi's for some platforms) ) ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String and UnicodeString and UTF8String
On 10 Jan 2011, at 13:33, Marco van de Voort wrote: In our previous episode, Jonas Maebe said: In current Delphi is String synonym for base type UnicodeString UTF-16 AFAIU ATM in FPC is String synonym for AnsiString (as in previos versions of Delphi) Are there any plans to change meaning of String type ? (like Delphi to UnicodeString , or UTF8String?) If/when this is done, it will only be with a compiler switch or directive. ( That won't be enough, since that would not change the relevant units and classes to such type. (e.g. tstringlist would remain defined ansistring) If it's a D2009-style ansistring, does that matter? Jonas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String and UnicodeString and UTF8String
In our previous episode, Jonas Maebe said: If/when this is done, it will only be with a compiler switch or directive. ( That won't be enough, since that would not change the relevant units and classes to such type. (e.g. tstringlist would remain defined ansistring) If it's a D2009-style ansistring, does that matter? A lot of conversion, since it will use ansistring(0) so reading/writing ansistring(cp_utf8) will force conversions. (0 means system encoding, $ means never convert) Besides that the usual three problems: - I don't know how VAR behaves in this case. (passing a ansistring(cp_utf8) to a var ansistring(0) parameter), - maybe overloading (only cornercases?) etc. - inheritance. FPC defines base classes as ansistring(0) parameters, and Lazarus wants to inherit and override them with a different type. This will clash. I've thought long and hard about this. Since the discussion what the dominant type should be won't stop anytime soon, and we probably will have to support both UTF8 (*nix) and UTF16 (Windows and *nix/QT) as basetypes in the long run, plus a time ANSI as legacy, the RTL has to be prepared for it anyway, we might as well allow this on all platforms from the start. (actually releasing them is a different question and depends on manpower) That doesn't mean that a per unit switch is useless, but I think a target switch to fixate the bulk of the cases will save both us and the users a lot of grief. ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String and UnicodeString and UTF8String
On 10 Jan 2011, at 13:57, Marco van de Voort wrote: In our previous episode, Jonas Maebe said: If/when this is done, it will only be with a compiler switch or directive. ( That won't be enough, since that would not change the relevant units and classes to such type. (e.g. tstringlist would remain defined ansistring) If it's a D2009-style ansistring, does that matter? A lot of conversion, since it will use ansistring(0) so reading/ writing ansistring(cp_utf8) will force conversions. (0 means system encoding, $ means never convert) Why should a tstringlist force ansistring(0)? Or does Delphi force it to be that way? Conversion may indeed be required for output (input would only pass on the encoding of the input if based on ansistring($)), but I think doing that only when necessary at the lowest level should be no problem. Many existing frameworks work that way. Besides that the usual three problems: - I don't know how VAR behaves in this case. (passing a ansistring(cp_utf8) to a var ansistring(0) parameter), var-parameters may indeed pose a problem in case some parameters of OS- neutral routines are required to have a particular encoding specified. - maybe overloading (only cornercases?) etc. Possibly, although I guess there are probably rules for that (whether they are document is another case though, probably...) - inheritance. FPC defines base classes as ansistring(0) parameters, and Lazarus wants to inherit and override them with a different type. This will clash. Why ansistring(0) for base classes? OS-level interfaces: yes, but why base classes? I've thought long and hard about this. Since the discussion what the dominant type should be won't stop anytime soon, and we probably will have to support both UTF8 (*nix) and UTF16 (Windows and *nix/QT) as basetypes in the long run, plus a time ANSI as legacy, the RTL has to be prepared for it anyway, we might as well allow this on all platforms from the start. (actually releasing them is a different question and depends on manpower) I agree that the RTL should work regardless of the used string encoding, but I don't see why a particular encoding should be enforced throughout the entire RTL rather than just using ansistring($) almost everywhere. I also agree that we should strive to minimize the number of conversions in the RTL for some encodings (in particular indeed ansi, utf-8 and utf-16), but again this should not require a specially compiled RTL. E.g., insert(ansistring($)), delete(ansistring($)), etc. can call to special-purpose versions for certain specific encodings of the input (e.g., for the three you mentioned), and only if the encoding is not directly supported or if different encodings are mixed then perform a round trip via some generic format (utf-16, utf-32, or something that depends on the platform). This has the advantage that you always have all optimal implementations available, regardless of the platform or default string encoding. It does not require extra work because we have to write all those versions also if we want the RTL to be compilable for different default string encodings. And three checks in a case statement are not going to define the performance in a context of atomic reference counting, dynamic memory management and the occasional code page conversion (and since this may reduce the number of code page conversions when working with non-native strings, it can also be a performance win). Outside the RTL, the encoding mainly matters if you perform manual low- level processing of a string (for i:=1 to length(s) do something_with(s[i])). But in that case your your code will either work with only one encoding and you have to enforce it via the parameter type anyway, or if it has to work with multiple encodings and then you can use a technique similar to what I described above for the RTL. That doesn't mean that a per unit switch is useless, but I think a target switch to fixate the bulk of the cases will save both us and the users a lot of grief. It's not really clear to me which problem this would solve, but I may be missing something. Jonas ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String and UnicodeString and UTF8String
In our previous episode, Jonas Maebe said: If it's a D2009-style ansistring, does that matter? A lot of conversion, since it will use ansistring(0) so reading/ writing ansistring(cp_utf8) will force conversions. (0 means system encoding, $ means never convert) Why should a tstringlist force ansistring(0)? I mean that if you locally (for your units) set string=utf8string, TStringList still would be ansistring(0) or whatever the default becomes. (and it could be UTF16 even) Since TStringList inherits from TStrings so would most Lazarus components. Or does Delphi force it to be that way? In D2009+ it is unicodestring, period. Everything is unicodestring (UTF16), ansistring (+ variants) are for legacy only, and people try to forget shortstring as quickly as possible. Backwards compatibility to pre D2009 is essentially abandonned. I think they didn't even try for exactly the reasons I mean to address here. I think in the planned Embarcadero cross-compile products, string will also be utf-16 on OS X and Linux. If only because it is (1) easier, and windows remains dominant by far (including UTF16 assuming codebases) (2) they plan to target QT. Keep in mind that soon it will not be possible to upgrade from ansistring to a current version anymore (and something like D5..D7 already is no longer upgradable). Embarcadero changed the upgrade rules. From Delphi related forums and maillist, I get the impression that most fulltime delphi programmers migrated to unicode, and the occasional and legacy users not. The gap between these two groups is widening, but contrary to Embarcadero, we will be dealing with significant portions of both groups for a while (as new/existing users) So the question is how we are going to deal with this information, without forcing a big bang like Embarcadero did, prepare to support both (or more? see below) schemes for a while, _AND_ deal with the fact that UTF16 is mostly alien on non-Windows. For me, having a mandatory UTF16 Unix is not an option, and a mandatory UTF8 Windows neither. (D2009+ incompatible) Since no one choice with one default type per target (or even one to rule them all) will satisfy anybody, I was thinking about setting up multiple targets. Of course it is uncharted territory, and while I lean towards that solution, it could be that there are hidden caveats. Conversion may indeed be required for output (input would only pass on the encoding of the input if based on ansistring($)) ansistring(0), system encoding would be more logical than $. $ is used more internally in string conversion routines and for strings that are not strings. But what does that mean on Windows, where the console encoding is OEMSTRING and not ansistring(0) ? but I think doing that only when necessary at the lowest level should be no problem. Many existing frameworks work that way. It touches all places where you touch the OS. But indeed one could try to split this by doing the classes utf8 or tunicodestring depending on OS. And we have to deal with Windows, where the default is UTF16. Besides that the usual three problems: - I don't know how VAR behaves in this case. (passing a ansistring(cp_utf8) to a var ansistring(0) parameter), var-parameters may indeed pose a problem in case some parameters of OS- neutral routines are required to have a particular encoding specified. - maybe overloading (only cornercases?) etc. Possibly, although I guess there are probably rules for that (whether they are document is another case though, probably...) - inheritance. FPC defines base classes as ansistring(0) parameters, and Lazarus wants to inherit and override them with a different type. This will clash. Why ansistring(0) for base classes? OS-level interfaces: yes, but why base classes? This is the core problem. What solution will do for everybody (legacy,Lazarus,Delphi/unicode?) or (ansistring(0), ansistring(cp_utf8) or TUnicodestring) ? And what do we do if e.g. Lazarus changes opinion and goes from utf8 to utf16 on Windows? (e.g. the Delphi/unicode becomes the dominant influx). And do we really want Lazarus' direction to fixate this for everybody? Or what if they bring in a new Kylix principle with utf16 base type? I'm very reluctant to make a choice here, and say insert conversions if something changes. I would build in some flexibility and potential differentiation from the start. At least in principle. As said, we can see which combinations are popular for release time. I've thought long and hard about this. Since the discussion what the dominant type should be won't stop anytime soon, and we probably will have to support both UTF8 (*nix) and UTF16 (Windows and *nix/QT) as basetypes in the long run, plus a time ANSI as legacy, the RTL has to be prepared for it anyway, we might as well allow this on all platforms from the start. (actually releasing
Re: [fpc-devel] String and UnicodeString and UTF8String
Jonas Maebe schrieb: In current Delphi is String synonym for base type UnicodeString UTF-16 AFAIU ATM in FPC is String synonym for AnsiString (as in previos versions of Delphi) Are there any plans to change meaning of String type ? (like Delphi to UnicodeString , or UTF8String?) If/when this is done, it will only be with a compiler switch or directive. AFAIR Delphi doesn't offer such a compiler option, because units with different settings do not fit together (2 VCL versions would not be sufficient). I'm not sure about details, but the Delphi designers certainly encountered problems that definitely forbid mixing string types. One such problem may be a slowdown due to many implicit string conversions, together with compiler warnings about possible losses on the conversion back to Ansi, and real losses as I observed in VB years ago. Another one may be the maintenance of duplicate (overloaded) procedures in the standard libraries. When FPC implements two distinct versions, adding another Unicode/Ansi level to the unit output tree, both versions can be compiled from the same source code, possibly using conditional compilation where necessary. For my part, I'd be happy with two definitely different Ansi and UTF(8) string types, with automatic conversion. But even then it would be wise to add another string parameter type, like Delphi RawByteString, that accepts both Ansi and UTF-8 arguments without implicit conversions. DoDi ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel
Re: [fpc-devel] String and UnicodeString and UTF8String
On 10 Jan 2011, at 16:27, Marco van de Voort wrote: In our previous episode, Jonas Maebe said: Why should a tstringlist force ansistring(0)? I mean that if you locally (for your units) set string=utf8string, TStringList still would be ansistring(0) or whatever the default becomes. I meant: why not use ansistring($) instead? You could even add a property to tstringlist that causes it to force the encoding of added strings to a particular code page whenever a string is added. Or does Delphi force it to be that way? In D2009+ it is unicodestring, period. Everything is unicodestring (UTF16), ansistring (+ variants) are for legacy only, and people try to forget shortstring as quickly as possible. Then a unicodestring version is certainly required, and an ansistring($) version would have to be called differently. I think in the planned Embarcadero cross-compile products, string will also be utf-16 on OS X and Linux. If only because it is (1) easier, and windows remains dominant by far (including UTF16 assuming codebases) (2) they plan to target QT. I think it's a good decision to keep it the same everywhere, since string=unicodestring is not an opaque type in any way. As a result, choosing a different string type on other platforms would probably break lots of code again. And regardless of which toolkit you target on Mac OS X, conversions will probably happen anyway. The encoding used by Carbon and Cocoa is not specified anywhere afaik, and the CFString/NSString they are based on can use any encoding internally (I guess that's probably also UTF-16 for ease of processing). For me, having a mandatory UTF16 Unix is not an option, and a mandatory UTF8 Windows neither. (D2009+ incompatible) I don't think UTF-16 everywhere would be a big problem. Conversion may indeed be required for output (input would only pass on the encoding of the input if based on ansistring($)) ansistring(0), system encoding would be more logical than $. $ is used more internally in string conversion routines and for strings that are not strings. The fact that the formal return type is $ does not mean, afaik, that you also have to return something whose internal encoding is set to $. It can still be an ansistring(0), ansistring(OEMSTRING) or whatever. It simply means that the encoding won't be forced to anything in particular when you assign a value to the function result. If you then assign this function result to another variable (which may have a forced encoding), then a conversion will happen if the forced encoding is different from the actual one. If you assign it to another ansistring($), no encoding change will happen in any case, and the destination string will inherit the source's encoding. But what does that mean on Windows, where the console encoding is OEMSTRING and not ansistring(0) ? As I said: ansistring($). but I think doing that only when necessary at the lowest level should be no problem. Many existing frameworks work that way. It touches all places where you touch the OS. But indeed one could try to split this by doing the classes utf8 or tunicodestring depending on OS. I'm not sure why you say indeed, because I did not propose to do that. I only proposed keeping as many RTL interfaces as possible in ansistring($) to have something that's a) generic, and b) with the least chance of resulting in encoding conversion However... And we have to deal with Windows, where the default is UTF16. ... since Delphi 2009 uses (unicode)string everywhere, we need at least also unicode versions. Why ansistring(0) for base classes? OS-level interfaces: yes, but why base classes? This is the core problem. What solution will do for everybody (legacy,Lazarus,Delphi/unicode?) or (ansistring(0), ansistring(cp_utf8) or TUnicodestring) ? And what do we do if e.g. Lazarus changes opinion and goes from utf8 to utf16 on Windows? (e.g. the Delphi/unicode becomes the dominant influx). And do we really want Lazarus' direction to fixate this for everybody? Or what if they bring in a new Kylix principle with utf16 base type? A unicodestring version for Delphi-compatibility, and if required an ansistring($) version for all other purposes (afaik that would also work with legacy ansistring=ansistring(0), although it's not yet clear to me what happens if you pass an empty ansistring(0) to a rawbytestring var-parameter -- is it still nil like with current ansitrings, or can you somehow extract its declared encoding?) I agree that the RTL should work regardless of the used string encoding, but I don't see why a particular encoding should be enforced throughout the entire RTL rather than just using ansistring($) almost everywhere. That only solves the 1-byte case. It's true that you probably need a separate overloaded version for unicodestring (just like we currently also have separate
Re: [fpc-devel] String and UnicodeString and UTF8String
On Monday, 10. January 2011 16.27:19 Marco van de Voort wrote: And there are three such cases - normal FPC and Delph 2007- code : ansistring(0) - Lazarus : ansistring=utf8 - Delphi 2009+ UTF16. - fpGUI: ansistring = utf-8 - MSEgui: existing FPC UnicodeString = utf-16 Martin ___ fpc-devel maillist - fpc-devel@lists.freepascal.org http://lists.freepascal.org/mailman/listinfo/fpc-devel