Re: [fpc-devel] String and UnicodeString and UTF8String

2011-01-14 Thread Hans-Peter Diettrich

Sven Barth schrieb:


Widestring will also grind the application to a halt due to being COM
based
on Windows.


How that?




WideString on Windows has no reference counting, thus everytime a 
WideString is assigned it needs to be copied.


I'm not so sure of that. AFAIR the field exists, but it's unused or 
reserved for shared memory management.


Of course the requirement, that a BSTR has to reside in shared memory, 
discourages the use of exactly that type for stringhandling inside an 
application.


I only wanted to prevent the introduction of another UTF16String type, 
in addition to WideString, BSTR (WinAPI) and UnicodeString (Delphi). 
Conversion-wise WideString/BSTR and (other) UTF-16 strings are equivalent.



Nearly all Windows API functions only allow single byte encodings or 
UTF-16. The only functions that I'm aware of, that can use UTF-8 
encoding is the console input/output API (if the codepage is set to 
UTF-8) [and also file I/O APIs, but they don't assume any encoding].


And the conversion functions of course (MBCStoWStr...).

DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] String and UnicodeString and UTF8String

2011-01-14 Thread Thaddy

On 14-1-2011 13:21, Hans-Peter Diettrich wrote:

Sven Barth schrieb:


Widestring will also grind the application to a halt due to being COM
based
on Windows.


How that?




WideString on Windows has no reference counting, thus everytime a 
WideString is assigned it needs to be copied.


I'm not so sure of that. AFAIR the field exists, but it's unused or 
reserved for shared memory management.



Yes, if you use the set of memory allocators I mentioned the field 
*will* be used. COM marshalling.
No com, no count. simple as that. It is unused, because the memory 
manager doesn't use it.
com is not implemented, unless you use a com based memory manager. No 
com, no reference count.




___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] String and UnicodeString and UTF8String

2011-01-13 Thread Martin Schreiber
On Wednesday, 12. January 2011 23.05:02 Juha Manninen wrote:
 Martin Schreiber kirjoitti maanantai 10 tammikuu 2011 19:22:49:
  On Monday, 10. January 2011 16.27:19 Marco van de Voort wrote:
   And there are three such cases
  
   - normal FPC and Delph 2007- code :  ansistring(0)
   - Lazarus : ansistring=utf8
   - Delphi 2009+  UTF16.
 
  - fpGUI: ansistring = utf-8
  - MSEgui: existing FPC UnicodeString = utf-16

 Without studying your code myself I guess you had to make many utility
 functions and classes yourself for UTF-16 ?
 Even the normal TStringList doesn't work.

Correct. MSEgui has a complete development environment for UnicodeString with 
an own set of lists, streams, file and directory functions and the like.

Martin
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] String and UnicodeString and UTF8String

2011-01-13 Thread Marco van de Voort
In our previous episode, Hans-Peter Diettrich said:
  non-native strings, it can also be a performance win).
  IMO a single encoding, i.e. UTF-8, can cover all cases.
  
  Well, for starters, it doesn't cover the existing Delphi/unicode codebase.
 
 Because it's bound to UTF-16? That's not a problem, because WideString 
 will continue to exist, and according conversions are still inserted by 
 the compiler.

That is DIY compatibility, or, in other words, no compaibility. 

Widestring will also grind the application to a halt due to being COM based
on Windows.
 
  While some hard core Ansi coders may whine about such a convention, the
  absence of implicit string conversions (except in external library calls)
  will make such applications more performant than mixed-encoding versions.
  
  I don't see why this is the case. A current system encoding application does
  not do any conversion. (except for GUI output, and that can be considered
  negiable to the actual GUI overhead)
 
 When system encoding changes with the target platform, indexed access to 
 such strings can lead to different results. Unless the compiler can read 
 the coder's mind...

You don't have to. The Delphi model provides a stringtype for the system
encoding, and then as such all strings from the system can be labeled. With
other stringtypes, the necessary conversions can be edited.

Likewise, e.g. win32 console routines can be labeled with OEMString. (Since
windows uses a different default encoding for the console)
 
  Why spend time in the design of multiple RTL/LCL versions, when 
  a single version will be perfectly sufficient?
  
  Why spent 13 years being compatible when you can throw it away in a
  second?
 
 It's sufficient to throw away what's no more needed :-)

The previous message from Jeff shows that even shortstring is still in major
production use. Nothing is unused and can be clipped without a long winded
transition, or Delphi 2009 like painful breaks.

Moreover, these discussions are useless since you know as well as I do that
no one stringtype will ever satisfy everybody. So IMHO it is time to take
the consequences from the 500 posts on this subject on the unicode subject
on this and other FPC/Lazarus lists and start thinking in solutions to
manage that, instead of reiterating the one type to rule them all mantra
ad infinitum.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] String and UnicodeString and UTF8String

2011-01-13 Thread Hans-Peter Diettrich

Marco van de Voort schrieb:

In our previous episode, Hans-Peter Diettrich said:

non-native strings, it can also be a performance win).

IMO a single encoding, i.e. UTF-8, can cover all cases.

Well, for starters, it doesn't cover the existing Delphi/unicode codebase.
Because it's bound to UTF-16? That's not a problem, because WideString 
will continue to exist, and according conversions are still inserted by 
the compiler.


That is DIY compatibility, or, in other words, no compaibility.


I still don't understand the problem :-(


Widestring will also grind the application to a halt due to being COM based
on Windows.


How that?


When system encoding changes with the target platform, indexed access to 
such strings can lead to different results. Unless the compiler can read 
the coder's mind...


You don't have to. The Delphi model provides a stringtype for the system
encoding, and then as such all strings from the system can be labeled. With
other stringtypes, the necessary conversions can be edited.


Indexed string access produces other results for Ansi and UTF-8 system 
encoding. Such code is not portable, and the data (ini files) are not, 
too. Allowing for UTF-8 as the system encoding will frustrate Windows 
users (dunno whether Windows allows for such a system encoding), and 
Linux users are frustrated when UTF-8 is disallowed.


Only solution: using OS encoding restricts the code to run on a single 
machine only, or on similarly configured machines.


The group of users, which accept this restriction, will be happy with a 
single AnsiString type and no implicit conversions. Without implicit 
conversions such a string type can hold UTF-8 as well.




Likewise, e.g. win32 console routines can be labeled with OEMString. (Since
windows uses a different default encoding for the console)


This either implies OEM encoding as the system encoding of Win32 console 
applications, or the use of multiple codepages, as before. But IMO Win32 
console also implements a W interface, so that it's up to the user to 
use whatever is more appropriate for his code.


The RTL has to distinguish between system-wide filesystem and GUI 
encoding, in file handling (CreateFile...).



Why spend time in the design of multiple RTL/LCL versions, when 
a single version will be perfectly sufficient?

Why spent 13 years being compatible when you can throw it away in a
second?

It's sufficient to throw away what's no more needed :-)


The previous message from Jeff shows that even shortstring is still in major
production use. Nothing is unused and can be clipped without a long winded
transition, or Delphi 2009 like painful breaks.


It's all about the well known dilemma:
- force (possibly many) implicit conversions, or
- supply multiple RTL/LCL versions, or
- break legacy user code by moving to a different (but again unique) 
string type.



Moreover, these discussions are useless since you know as well as I do that
no one stringtype will ever satisfy everybody. So IMHO it is time to take
the consequences from the 500 posts on this subject on the unicode subject
on this and other FPC/Lazarus lists and start thinking in solutions to
manage that, instead of reiterating the one type to rule them all mantra
ad infinitum.


The discussion is only about the pros and cons of the various possible 
solutions. I.e. it should reveal the critical cases and consequences, 
that have to be considered and handled in every implementation.


The implementation can choose any model. Different models can be 
implemented as well, so that the final decision about the new standard 
can be delayed, until the models can be tested in real world applications.


One model has already been implemented: UTF-8. It may need some 
adds/improvements, like a *hard* separation of AnsiString from 
UTF8String, and nothing has to be thrown away.


DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] String and UnicodeString and UTF8String

2011-01-13 Thread Sven Barth

On 12.01.2011 22:40, Marco van de Voort wrote:

In our previous episode, Sven Barth said:

legacy code can be broken by (eventually) required changes to set of
char, sizeof(char) and PChar, sizeof(string) as opposed to
Length(string), upper/lower conversion, and many more not so obvious
consequences.


I don't believe that PChar will be touched, because to much code that
interfaces with C code depends on that. Although its declaration might
not be the same then and become PChar = PAnsiChar instead of PChar =
^Char if Char is changed (currently its PAnsiChar = PChar).


Current Delphi _does_ regard char as equivalent lowlevel type to string. So
whatever you choose as string (8 or 16-bit), pchar will match it by changing
to pansichar or pwidechar


Oh come on -.-

There are some days on which I really dislike the developers of Delphi...

Regards,
Sven
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] String and UnicodeString and UTF8String

2011-01-13 Thread Sven Barth

On 13.01.2011 18:57, Hans-Peter Diettrich wrote:

Widestring will also grind the application to a halt due to being COM
based
on Windows.


How that?




WideString on Windows has no reference counting, thus everytime a 
WideString is assigned it needs to be copied.



When system encoding changes with the target platform, indexed access
to such strings can lead to different results. Unless the compiler
can read the coder's mind...


You don't have to. The Delphi model provides a stringtype for the system
encoding, and then as such all strings from the system can be labeled.
With
other stringtypes, the necessary conversions can be edited.


Indexed string access produces other results for Ansi and UTF-8 system
encoding. Such code is not portable, and the data (ini files) are not,
too. Allowing for UTF-8 as the system encoding will frustrate Windows
users (dunno whether Windows allows for such a system encoding), and
Linux users are frustrated when UTF-8 is disallowed.



Nearly all Windows API functions only allow single byte encodings or 
UTF-16. The only functions that I'm aware of, that can use UTF-8 
encoding is the console input/output API (if the codepage is set to 
UTF-8) [and also file I/O APIs, but they don't assume any encoding].


Regards,
Sven
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] String and UnicodeString and UTF8String

2011-01-13 Thread Thaddy

On 13-1-2011 21:40, Sven Barth wrote:


WideString on Windows has no reference counting, thus everytime a 
WideString is assigned it needs to be copied.
Not exactly true. widestring is com marshaled and thus has reference 
counting on the com level. afaik .
As long as your memorymanager is com marshaled too, that is. And since 
most pascal memory manager versions do not support com directly, it goes 
wrong in a big way.
I once wrote a simple com memory manager to test this. Performance stays 
sh*t, but strings seem to be counted, not copied.
If you use coTaskMemAlloc, coTaskMemFree,CoTaskMemRealloc  in your 
memory manager you will see what I mean.

At least it comes close, but slow it will stay.


___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] String and UnicodeString and UTF8String

2011-01-13 Thread Martin Schreiber
On Thursday, 13. January 2011 18.57:00 Hans-Peter Diettrich wrote:

 The implementation can choose any model. Different models can be
 implemented as well, so that the final decision about the new standard
 can be delayed, until the models can be tested in real world applications.

 One model has already been implemented: UTF-8. It may need some
 adds/improvements, like a *hard* separation of AnsiString from
 UTF8String, and nothing has to be thrown away.

Another already implemented model is utf-16 UnicodeString in MSEgui. Needs no 
changes in Free Pascal compiler.

Martin
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] String and UnicodeString and UTF8String

2011-01-12 Thread Michael Schnell

On 01/11/2011 05:50 PM, Hans-Peter Diettrich wrote:


Since the generic Delphi string type can be any Unicode encoding now,


This

 From what O read I understand
that the dynamically code string type can hold 1, 2, and 4 byte (maybe
even more) Codes for it's elements (denoted in one control-value) and
each of those (theoretically) in different coding schemes (denoted in
another control-value), allowing e.g. for UTF-8, UTF-16, UCS4, German
ANSI, raw Byte, string


 is what I (not owning a Delphi  2007) thought, too, and have been 
bashed for.


But The document Delphi and Unicode by Marco Cantu ( 
http://edn.embarcadero.com/article/images/38980/Delphi_and_Unicode.pdf 
), dated Nov, 2008, in fact states:


length, the second element is the reference count. In Delphi 2009 the 
representation for

reference-counted strings becomes:

-12-10 -8-4
String reference address

Code pageElem sizeRef countlength   First char of string

Beside the length and reference count, the new fields represent the 
element size and the code
page. While the element size is used to discriminate between AnsiString 
and UnicodeString, the
code page makes sense in particular for the AnsiString type (as it works 
in Delphi 2009), as the

UnicodeString type has the fixed code page 1200.
A corresponding support data structure is declared in the implementation 
section of System unit as:

type
  PStrRec = ^StrRec;
StrRec = packed record
codePage: Word;
elemSize: Word;
refCnt: Longint;
length: Longint;
  end;

But maybe the document is outdated.

-Michael

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] String and UnicodeString and UTF8String

2011-01-12 Thread Michael Schnell

On 01/11/2011 05:19 PM, Hans-Peter Diettrich wrote:


IMO a single encoding, i.e. UTF-8, can cover all cases.


Of course you are right here, but there are some things to be considered:

In Windows (and maybe elsewhere, too) a two-Byte API (e.g. UTF-16) needs 
to be used, forcing lots of conversions when doing GUI applications.


_All_ beginners will use s[i] and expect to get a character without any 
afterthought. They will be very disappointed when not using English if 
they get bytes instead of characters. The count of the frustrated will 
be much smaller (but Zero) when doing Widestring/Widechar and they get 
Words instead of Characters.


Eliminating the s[i] syntax would trash  a lot of legacy code and the 
decent replacement (finding the correct character and moving it into a 
DWord in UCS4) is slow and still does not handle all the funny Unicode 
character-combining stuff. But the count of frustrated beginners might 
be further reduced.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] String and UnicodeString and UTF8String

2011-01-12 Thread Hans-Peter Diettrich

Jeff Wormsley schrieb:

On 01/11/2011 11:10 AM, Hans-Peter Diettrich wrote:


UTF-8 combines an single (byte-based) storage type with lossless 
encoding of full Unicode. Ansi and UCS2 (really UTF-16) only *look* 
easier to handle in user code, but both will fail and require special 
code whenever characters outside the assumed codepage may occur.


Preface: I don't write international apps, and probably won't for the 
foreseeable future...


Then you may be bound to some legacy compiler version when the 
stringhandling will change in some future time, as happened to Delphi 
users. Continued support of AnsiString type(s) is not enough, because 
legacy code can be broken by (eventually) required changes to set of 
char, sizeof(char) and PChar, sizeof(string) as opposed to 
Length(string), upper/lower conversion, and many more not so obvious 
consequences.


Isn't all of this concentration on trying to make strings have single 
byte characters (who cares how they are encoded), using the argument 
that it is somehow faster, incorrect for just about any modern 
processor, including embedded CPU's such as ARM?  It was my 
understanding that 32 bit aligned access was always faster than byte 
aligned access on just about any CPU FPC still supports.


See Marco's comment about data size etc.

The argument holds just fine for memory, but I don't really get the 
speed argument.  Maybe I'm missing something.


FPC (the compiler) still uses ShortStrings wherever possible, because 
that was found out as the most efficient string representation. This is 
partially due to the ASCII encoding of source code, except for string 
literals. But like you, I'm not sure that this argument still holds on 
modern hardware.


Speed loss may occur due to:
- data shuffling in general (total byte count)
- (implied) string conversion
- indexed access to MBCS[1] strings (including UTF-8/16)

[1] All encodings of variable character size discourage indexed access 
to strings. Then char can have multiple meanings, as either 
representing the (physical) string/array *element* size, or the 
(logical) size of an *codepoint*. Until now most users, including you, 
most probably don't realize that difference between phyiscal and logical 
characters, and assume that sizeof(char) always is 1, and eventually 
that sizeof(WideChar) is 2. IMO variables of type char should have at 
least 3 (better 4) bytes in an Unicode environment, in order to maintain 
the correspondence between physical and logical characters. As already 
suggested the packed keyword could be applied to strings and char 
arrays, to definitely signal to the user that indexed access should not 
be used with such variables, unless a speed penalty is acceptable.


DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] String and UnicodeString and UTF8String

2011-01-12 Thread Sven Barth

On 12.01.2011 13:38, Hans-Peter Diettrich wrote:

Jeff Wormsley schrieb:

On 01/11/2011 11:10 AM, Hans-Peter Diettrich wrote:


UTF-8 combines an single (byte-based) storage type with lossless
encoding of full Unicode. Ansi and UCS2 (really UTF-16) only *look*
easier to handle in user code, but both will fail and require special
code whenever characters outside the assumed codepage may occur.


Preface: I don't write international apps, and probably won't for the
foreseeable future...


Then you may be bound to some legacy compiler version when the
stringhandling will change in some future time, as happened to Delphi
users. Continued support of AnsiString type(s) is not enough, because
legacy code can be broken by (eventually) required changes to set of
char, sizeof(char) and PChar, sizeof(string) as opposed to
Length(string), upper/lower conversion, and many more not so obvious
consequences.


I don't believe that PChar will be touched, because to much code that 
interfaces with C code depends on that. Although its declaration might 
not be the same then and become PChar = PAnsiChar instead of PChar = 
^Char if Char is changed (currently its PAnsiChar = PChar).


Regards,
Sven
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] String and UnicodeString and UTF8String

2011-01-12 Thread Marco van de Voort
In our previous episode, Hans-Peter Diettrich said:
  memory management and the occasional code page conversion (and since 
  this may reduce the number of code page conversions when working with 
  non-native strings, it can also be a performance win).
 
 IMO a single encoding, i.e. UTF-8, can cover all cases.

Well, for starters, it doesn't cover the existing Delphi/unicode codebase.

 While some hard core Ansi coders may whine about such a convention, the
 absence of implicit string conversions (except in external library calls)
 will make such applications more performant than mixed-encoding versions.

I don't see why this is the case. A current system encoding application does
not do any conversion. (except for GUI output, and that can be considered
negiable to the actual GUI overhead)
 
 Why spend time in the design of multiple RTL/LCL versions, when 
 a single version will be perfectly sufficient?

Why spent 13 years being compatible when you can throw it away in a second?
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] String and UnicodeString and UTF8String

2011-01-12 Thread Marco van de Voort
In our previous episode, Sven Barth said:
  legacy code can be broken by (eventually) required changes to set of
  char, sizeof(char) and PChar, sizeof(string) as opposed to
  Length(string), upper/lower conversion, and many more not so obvious
  consequences.
 
 I don't believe that PChar will be touched, because to much code that 
 interfaces with C code depends on that. Although its declaration might 
 not be the same then and become PChar = PAnsiChar instead of PChar = 
 ^Char if Char is changed (currently its PAnsiChar = PChar).

Current Delphi _does_ regard char as equivalent lowlevel type to string. So
whatever you choose as string (8 or 16-bit), pchar will match it by changing
to pansichar or pwidechar
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] String and UnicodeString and UTF8String

2011-01-12 Thread DaWorm
On Wed, Jan 12, 2011 at 7:38 AM, Hans-Peter Diettrich
drdiettri...@aol.com wrote:

 Until now most users, including you, most probably don't
 realize that difference between phyiscal and logical characters, and assume
 that sizeof(char) always is 1

Oh, I'm aware of it.  But to date, I haven't had to really deal with
it in Delphi or FPC. My use of strings is either ancient legacy (from
TP/BP days) where I simply changed all references to string to
shortstring or low level Windows API code, where I'm dealing with
PChar.

I find these discussions fascinating, but as they say in the southern
US, I don't have a dog in this hunt.  Whatever the decision, I'll
probably continue to use shortstring.

Jeff.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] String and UnicodeString and UTF8String

2011-01-12 Thread Hans-Peter Diettrich

Marco van de Voort schrieb:

In our previous episode, Hans-Peter Diettrich said:
memory management and the occasional code page conversion (and since 
this may reduce the number of code page conversions when working with 
non-native strings, it can also be a performance win).

IMO a single encoding, i.e. UTF-8, can cover all cases.


Well, for starters, it doesn't cover the existing Delphi/unicode codebase.


Because it's bound to UTF-16? That's not a problem, because WideString 
will continue to exist, and according conversions are still inserted by 
the compiler.



While some hard core Ansi coders may whine about such a convention, the
absence of implicit string conversions (except in external library calls)
will make such applications more performant than mixed-encoding versions.


I don't see why this is the case. A current system encoding application does
not do any conversion. (except for GUI output, and that can be considered
negiable to the actual GUI overhead)


When system encoding changes with the target platform, indexed access to 
such strings can lead to different results. Unless the compiler can read 
the coder's mind...


Why spend time in the design of multiple RTL/LCL versions, when 
a single version will be perfectly sufficient?


Why spent 13 years being compatible when you can throw it away in a second?


It's sufficient to throw away what's no more needed :-)

DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] String and UnicodeString and UTF8String

2011-01-11 Thread Michael Schnell

On 01/10/2011 04:27 PM, Marco van de Voort wrote:


And what do we do if e.g. Lazarus changes opinion and goes from utf8 to
utf16 on Windows? (e.g. the Delphi/unicode becomes the dominant influx).

The current way Lazarus works (UTF-8 in a String Type called 
ANSIString, as well with Windows as with Linux without any 
auto-Conversion, introducing funny problems e.g. when just assigning a 
string constant to a Widestring) does not seem very appropriate.


I feel the logical move would be to use the dynamically encoded string 
type in the LCL API, but there might be some nasty hidden problems (e.g. 
with var parameters).


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] String and UnicodeString and UTF8String

2011-01-11 Thread Michael Schnell

On 01/10/2011 04:27 PM, Marco van de Voort wrote:

I think in the planned Embarcadero cross-compile products, string will also
be utf-16 on OS X and Linux.

Yak,

I had hoped that using the dynamically encoded string type nearly 
everywhere would allow for a great lot of not OS-specific code in the 
VCL (and LCL) without the need for excessive conversions maintaining the 
systems' coding (UTF-16 or UTF-8) in and out with GUI-centric user code.


I thought this would have been the main reason for introducing the 
additional complexity of the dynamically encoded string type.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] String and UnicodeString and UTF8String

2011-01-11 Thread LacaK



I think at most two are required for any target: unicodestring (D2009 
compatibility), and if really necessary because somehow the unicodestring 
version causes too much overhead, an ansistring($) version as well. That's 
only for the classes though, I think most of the base RTL can be simply 
ansistring($).
  
So if I understand correctly, then UnicodeString and also AnsiString 
types must be extended that they will hold also information about 
actual codepage (encoding) of string data they hold.
(AFAIK ATM they hold only information about reference count and size 
and of course data)


I am not expert, so I do not understand all aspect/problems which are 
joined with proper string handling, but some kind of implicit 
conversions (based on actual encoding of string data) is necessary (ANSI 
- UTF-8 - UTF-16 - ANSI ... etc.).


For example known problem with Euro currency symbol. In Windows is in 
CurrencyString global variable stored using ANSI codepage, but used in 
LCL (which expect UTF-8 encoding) without any explicit conversion, what 
leads to displayng ? instead of € (for example in TDBEdit or TDBGrid)


Another problem when displaying character data in data-aware database 
controls (TDBEdit, TDBGrid). Data-aware controls (LCL) reads data from 
TField descendatns (FCL) using TField.Text property which returns 
string (without codepage information is not clear if it is AnsiString 
or UTF8String or UnicodeString). LCL expect UTF-8 strings, but it is not 
true in all cases (for example in case of ODBC)


-Laco.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] String and UnicodeString and UTF8String

2011-01-11 Thread Hans-Peter Diettrich

Marco van de Voort schrieb:


Btw, while looking up rawbytestring I saw this in the Delphi help:

Declaring variables or fields of type RawByteString should rarely, if ever,
be done, because this practice can lead to undefined behavior and potential
data loss.


IIRC RawByteString should be used like OpenString, as subroutine 
argument type only. In contrast to the name, a RawByteString has a 
variable encoding, i.e. implicit conversions are inserted for every use 
with other string types. Thus AnyByteString had been a better name for 
that type, IMO. Delphi does no more support (officially) non-textual 
data in strings, and TBytes should be used for such data.




How will you deal with e.g. Windows? Legacy string=ansistring(0), D2009 is
string=utf16 TUnicodestring?


Is an Delphi UnicodeString really compatible with an WinAPI 
WideString/BSTR? AFAIR all BSTRs must reside in shared memory, so that 
copies are required for every API call.




Mainly the question what the classtree will be. The main operating type used
in applications.  You always need two RTLs for that, since it can be 1 or 2
byte, and even if you fixated it on one byte encodings, rawbytestring would
force you to write case statements in each and every procedure.


UTF-8 combines an single (byte-based) storage type with lossless 
encoding of full Unicode. Ansi and UCS2 (really UTF-16) only *look* 
easier to handle in user code, but both will fail and require special 
code whenever characters outside the assumed codepage may occur.


DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] String and UnicodeString and UTF8String

2011-01-11 Thread Hans-Peter Diettrich

Jonas Maebe schrieb:

This has the advantage that you always have all optimal implementations 
available, regardless of the platform or default string encoding. It 
does not require extra work because we have to write all those versions 
also if we want the RTL to be compilable for different default string 
encodings. And three checks in a case statement are not going to define 
the performance in a context of atomic reference counting, dynamic 
memory management and the occasional code page conversion (and since 
this may reduce the number of code page conversions when working with 
non-native strings, it can also be a performance win).


IMO a single encoding, i.e. UTF-8, can cover all cases. While some hard 
core Ansi coders may whine about such a convention, the absence of 
implicit string conversions (except in external library calls) will make 
such applications more performant than mixed-encoding versions.


The argument my characters *always* will be inside my preferred 
codepage will prove false sooner or later. While it's not up to a 
programming language to teach people the better way of coding, the 
required efforts of the FPC/Lazarus developers IMO should have more 
weight. Why spend time in the design of multiple RTL/LCL versions, when 
a single version will be perfectly sufficient?


DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] String and UnicodeString and UTF8String

2011-01-11 Thread Hans-Peter Diettrich

Jonas Maebe schrieb:


And we have to deal with Windows, where the default is UTF16.


... since Delphi 2009 uses (unicode)string everywhere, we need at least also 
unicode versions.


Since the generic Delphi string type can be any Unicode encoding now, 
it IMO would be legal to use UTF-8 or UTF-32 for it internally, in FPC. 
Some code, expecting UCS2/BMP text only, may become a bit slower due to 
according conversions in indexed access to chars, but no other 
*implicit* conversions will ever occur. Likewise the generic char type 
could become a 32 bit type, so that it can hold *every* Unicode codepoint.


For both string and array of char the packed keyword could be used 
to distinguish between different bytecount and encoding, where unpacked 
types contain UTF-32 chars. This would speed up user code with indexed 
access, in contrast to both UTF-8 and -16 encodings, and it would allow 
the user to optimize his code for either speed or size. Indexed access 
to packed types simply could be disallowed, without breaking anything 
since the default is not packed.


Just some more ideas...

DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] String and UnicodeString and UTF8String

2011-01-11 Thread Jeff Wormsley

On 01/11/2011 11:10 AM, Hans-Peter Diettrich wrote:


UTF-8 combines an single (byte-based) storage type with lossless 
encoding of full Unicode. Ansi and UCS2 (really UTF-16) only *look* 
easier to handle in user code, but both will fail and require special 
code whenever characters outside the assumed codepage may occur.


Preface: I don't write international apps, and probably won't for the 
foreseeable future...


Isn't all of this concentration on trying to make strings have single 
byte characters (who cares how they are encoded), using the argument 
that it is somehow faster, incorrect for just about any modern 
processor, including embedded CPU's such as ARM?  It was my 
understanding that 32 bit aligned access was always faster than byte 
aligned access on just about any CPU FPC still supports.


The argument holds just fine for memory, but I don't really get the 
speed argument.  Maybe I'm missing something.


Jeff.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] String and UnicodeString and UTF8String

2011-01-11 Thread Marco van de Voort
In our previous episode, Jeff Wormsley said:
  encoding of full Unicode. Ansi and UCS2 (really UTF-16) only *look* 
  easier to handle in user code, but both will fail and require special 
  code whenever characters outside the assumed codepage may occur.
 
 Preface: I don't write international apps, and probably won't for the 
 foreseeable future...
 
 Isn't all of this concentration on trying to make strings have single 
 byte characters (who cares how they are encoded), using the argument 
 that it is somehow faster, incorrect for just about any modern 
 processor, including embedded CPU's such as ARM?

  It was my 
 understanding that 32 bit aligned access was always faster than byte 
 aligned access on just about any CPU FPC still supports.

1-byte access is always 1-byte aligned, and the memory system is still
slower than these kind of issues. And you shuffle a lot of zeroes extra
around.

But the trouble is also that 2-byte situation doesn't really solve anything,
(you still have surrogates and it never will be as simple as it was), and a
much bigger problem with legacy (how many two byte data do you get daily,
and how much 1 byte?)

 The argument holds just fine for memory, but I don't really get the 
 speed argument.  Maybe I'm missing something.

Shoveling twice as much memory around IS the speed argument :-)
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] String and UnicodeString and UTF8String

2011-01-10 Thread Michael Schnell

On 01/10/2011 09:12 AM, LacaK wrote:

In current Delphi is String synonym for base type UnicodeString UTF-16
AFAIK, in current Delphi (which I don't have) a String is a variable 
that can contain dynamically coded  informations (such as locally coded 
8-Bit ANSI, UTF-8, UTF-16, ...) and - of course - know which code it holds.


If a string is generated by the VCL from a Window API function, the 
coding will be UTF-16, though, but if you create a string with some 
other coding it will be automatically re-coded to UTF16 before sending 
it into a Windows API function.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] String and UnicodeString and UTF8String

2011-01-10 Thread LacaK


AFAIK, in current Delphi (which I don't have) a String is a variable 
that can contain dynamically coded  informations (such as locally 
coded 8-Bit ANSI, UTF-8, UTF-16, ...) and - of course - know which 
code it holds.
I understand By default, variables declared as type String are 
*UnicodeString*.**, that String=UnicodeString

See: http://docwiki.embarcadero.com/VCL/en/System.UnicodeString
and also 
http://docwiki.embarcadero.com/RADStudio/en/String_Types#UnicodeString


Note alse, that AnsiString holds additional informations about character 
encoding:
The AnsiString 
http://docwiki.embarcadero.com/VCL/en/System.AnsiString structure 
contains a 32-bit length indicator, a 32-bit reference count, a 16-bit 
data length indicating the number of bytes per character, and a 16-bit 
code page.


-Laco.



If a string is generated by the VCL from a Window API function, the 
coding will be UTF-16, though, but if you create a string with some 
other coding it will be automatically re-coded to UTF16 before sending 
it into a Windows API function.


-Michael
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel



___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] String and UnicodeString and UTF8String

2011-01-10 Thread Jonas Maebe


On 10 Jan 2011, at 09:12, LacaK wrote:


In current Delphi is String synonym for base type UnicodeString UTF-16
AFAIU ATM in FPC is String synonym for AnsiString (as in previos  
versions of Delphi)

Are there any plans to change meaning of String type ?
(like Delphi to UnicodeString , or UTF8String?)


If/when this is done, it will only be with a compiler switch or  
directive.


Are there any plans to intorduce implicit conversions between  
AnsiStrings (ANSI code page) to UTF8Strings (UTF-8 encoded) or  
something like this ?


That would be part of the general D2009 ansistring support you  
referred to in your other message. There is an svn branch (cpstrnew)  
that contains some preliminary work for this functionality, but nobody  
has worked on it for a long time. Developers interested in working on  
finishing that functionality are welcome!



Jonas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] String and UnicodeString and UTF8String

2011-01-10 Thread Marco van de Voort
In our previous episode, Jonas Maebe said:
  In current Delphi is String synonym for base type UnicodeString UTF-16
  AFAIU ATM in FPC is String synonym for AnsiString (as in previos  
  versions of Delphi)
  Are there any plans to change meaning of String type ?
  (like Delphi to UnicodeString , or UTF8String?)
 
 If/when this is done, it will only be with a compiler switch or  
 directive.

(
That won't be enough, since that would not change the relevant units and
classes to such type. (e.g. tstringlist would remain defined ansistring)

For this to work, we probably will have to split targets into UTF16 and
ansi. (and maybe multiple ansi's for some platforms)
) 
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] String and UnicodeString and UTF8String

2011-01-10 Thread Jonas Maebe


On 10 Jan 2011, at 13:33, Marco van de Voort wrote:


In our previous episode, Jonas Maebe said:
In current Delphi is String synonym for base type UnicodeString  
UTF-16

AFAIU ATM in FPC is String synonym for AnsiString (as in previos
versions of Delphi)
Are there any plans to change meaning of String type ?
(like Delphi to UnicodeString , or UTF8String?)


If/when this is done, it will only be with a compiler switch or
directive.


(
That won't be enough, since that would not change the relevant units  
and
classes to such type. (e.g. tstringlist would remain defined  
ansistring)


If it's a D2009-style ansistring, does that matter?


Jonas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] String and UnicodeString and UTF8String

2011-01-10 Thread Marco van de Voort
In our previous episode, Jonas Maebe said:
 
  If/when this is done, it will only be with a compiler switch or
  directive.
 
  (
  That won't be enough, since that would not change the relevant units  
  and
  classes to such type. (e.g. tstringlist would remain defined  
  ansistring)
 
 If it's a D2009-style ansistring, does that matter?

A lot of conversion, since it will use ansistring(0) so reading/writing
ansistring(cp_utf8) will force conversions. (0 means system encoding, $
means never convert)

Besides that the usual three problems:

- I  don't know how VAR behaves in this case. (passing a ansistring(cp_utf8) to 
a var ansistring(0) parameter), 
- maybe overloading (only cornercases?) etc.
- inheritance. FPC defines base classes as ansistring(0) parameters, and
   Lazarus wants to inherit and override them with a different type. This will 
clash.

I've thought long and hard about this. Since the discussion what the
dominant type should be won't stop anytime soon, and we probably will have
to support both UTF8 (*nix) and UTF16 (Windows and *nix/QT) as basetypes in
the long run, plus a time ANSI as legacy, the RTL has to be prepared for it
anyway, we might as well allow this on all platforms from the start. 
(actually releasing them is a different question and depends on manpower)

That doesn't mean that a per unit switch is useless, but I think a target
switch to fixate the bulk of the cases will save both us and the users a lot
of grief.
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] String and UnicodeString and UTF8String

2011-01-10 Thread Jonas Maebe


On 10 Jan 2011, at 13:57, Marco van de Voort wrote:


In our previous episode, Jonas Maebe said:


If/when this is done, it will only be with a compiler switch or
directive.


(
That won't be enough, since that would not change the relevant units
and
classes to such type. (e.g. tstringlist would remain defined
ansistring)


If it's a D2009-style ansistring, does that matter?


A lot of conversion, since it will use ansistring(0) so reading/ 
writing
ansistring(cp_utf8) will force conversions. (0 means system  
encoding, $

means never convert)


Why should a tstringlist force ansistring(0)? Or does Delphi force it  
to be that way?


Conversion may indeed be required for output (input would only pass on  
the encoding of the input if based on ansistring($)), but I think  
doing that only when necessary at the lowest level should be no  
problem. Many existing frameworks work that way.



Besides that the usual three problems:

- I  don't know how VAR behaves in this case. (passing a  
ansistring(cp_utf8) to a var ansistring(0) parameter),


var-parameters may indeed pose a problem in case some parameters of OS- 
neutral routines are required to have a particular encoding specified.



- maybe overloading (only cornercases?) etc.


Possibly, although I guess there are probably rules for that (whether  
they are document is another case though, probably...)


- inheritance. FPC defines base classes as ansistring(0) parameters,  
and
  Lazarus wants to inherit and override them with a different type.  
This will clash.


Why ansistring(0) for base classes? OS-level interfaces: yes, but why  
base classes?



I've thought long and hard about this. Since the discussion what the
dominant type should be won't stop anytime soon, and we probably  
will have
to support both UTF8 (*nix) and UTF16 (Windows and *nix/QT) as  
basetypes in
the long run, plus a time ANSI as legacy, the RTL has to be prepared  
for it

anyway, we might as well allow this on all platforms from the start.
(actually releasing them is a different question and depends on  
manpower)


I agree that the RTL should work regardless of the used string  
encoding, but I don't see why a particular encoding should be enforced  
throughout the entire RTL rather than just using ansistring($)  
almost everywhere.


I also agree that we should strive to minimize the number of  
conversions in the RTL for some encodings (in particular indeed ansi,  
utf-8 and utf-16), but again this should not require a specially  
compiled RTL. E.g., insert(ansistring($)),  
delete(ansistring($)), etc. can call to special-purpose versions  
for certain specific encodings of the input (e.g., for the three you  
mentioned), and only if the encoding is not directly supported or if  
different encodings are mixed then perform a round trip via some  
generic format (utf-16, utf-32, or something that depends on the  
platform).


This has the advantage that you always have all optimal  
implementations available, regardless of the platform or default  
string encoding. It does not require extra work because we have to  
write all those versions also if we want the RTL to be compilable for  
different default string encodings. And three checks in a case  
statement are not going to define the performance in a context of  
atomic reference counting, dynamic memory management and the  
occasional code page conversion (and since this may reduce the number  
of code page conversions when working with non-native strings, it  
can also be a performance win).


Outside the RTL, the encoding mainly matters if you perform manual low- 
level processing of a string (for i:=1 to length(s) do  
something_with(s[i])). But in that case your your code will either  
work with only one encoding and you have to enforce it via the  
parameter type anyway, or if it has to work with multiple encodings  
and then you can use a technique similar to what I described above for  
the RTL.


That doesn't mean that a per unit switch is useless, but I think a  
target
switch to fixate the bulk of the cases will save both us and the  
users a lot

of grief.


It's not really clear to me which problem this would solve, but I may  
be missing something.



Jonas
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] String and UnicodeString and UTF8String

2011-01-10 Thread Marco van de Voort
In our previous episode, Jonas Maebe said:

  If it's a D2009-style ansistring, does that matter?
 
  A lot of conversion, since it will use ansistring(0) so reading/ 
  writing
  ansistring(cp_utf8) will force conversions. (0 means system  
  encoding, $
  means never convert)
 
 Why should a tstringlist force ansistring(0)?

I mean that if you locally (for your units) set string=utf8string,
TStringList still would be ansistring(0) or whatever the default becomes. 
(and it could be UTF16 even)
Since TStringList inherits from TStrings so would most Lazarus components.

 Or does Delphi force it  to be that way?

In D2009+ it is unicodestring, period. Everything is unicodestring (UTF16),
ansistring (+ variants) are for legacy only, and people try to forget
shortstring as quickly as possible.

Backwards compatibility to pre D2009 is essentially abandonned. I think they
didn't even try for exactly the reasons I mean to address here.

I think in the planned Embarcadero cross-compile products, string will also
be utf-16 on OS X and Linux.  If only because it is (1) easier, and windows
remains dominant by far (including UTF16 assuming codebases) (2) they plan
to target QT. 

Keep in mind that soon it will not be possible to upgrade from ansistring to
a current version anymore (and something like D5..D7 already is no longer
upgradable).  Embarcadero changed the upgrade rules.

From Delphi related forums and maillist, I get the impression that most
fulltime delphi programmers migrated to unicode, and the occasional and
legacy users not. The gap between these two groups is widening, but contrary
to Embarcadero, we will be dealing with significant portions of both groups
for a while (as new/existing users) 


So the question is how we are going to deal with this information, without
forcing a big bang like Embarcadero did, prepare to support both (or more? 
see below) schemes for a while, _AND_ deal with the fact that UTF16 is
mostly alien on non-Windows.

For me, having a mandatory UTF16 Unix is not an option, and a mandatory UTF8
Windows neither.  (D2009+ incompatible)

Since no one choice with one default type per target (or even one to rule
them all) will satisfy anybody, I was thinking about setting up multiple
targets.

Of course it is uncharted territory, and while I lean towards that solution,
it could be that there are hidden caveats.
 
 Conversion may indeed be required for output (input would only pass on  
 the encoding of the input if based on ansistring($))

ansistring(0), system encoding would be more logical than $. $ is
used more internally in string conversion routines and for strings that are
not strings.

But what does that mean on Windows, where the console encoding is OEMSTRING
and not ansistring(0) ?  

 but I think doing that only when necessary at the lowest level should be
 no problem.  Many existing frameworks work that way.

It touches all places where you touch the OS. But indeed one could try to
split this by doing the classes utf8 or tunicodestring depending on OS.

And we have to deal with Windows, where the default is UTF16.

  Besides that the usual three problems:
 
  - I  don't know how VAR behaves in this case. (passing a  
  ansistring(cp_utf8) to a var ansistring(0) parameter),
 
 var-parameters may indeed pose a problem in case some parameters of OS- 
 neutral routines are required to have a particular encoding specified.

  - maybe overloading (only cornercases?) etc.
 
 Possibly, although I guess there are probably rules for that (whether  
 they are document is another case though, probably...)
 
  - inheritance. FPC defines base classes as ansistring(0) parameters,  
  and
Lazarus wants to inherit and override them with a different type.  
  This will clash.
 
 Why ansistring(0) for base classes? OS-level interfaces: yes, but why  
 base classes?

This is the core problem. What solution will do for everybody
(legacy,Lazarus,Delphi/unicode?) or (ansistring(0), ansistring(cp_utf8) or
TUnicodestring) ?

And what do we do if e.g. Lazarus changes opinion and goes from utf8 to
utf16 on Windows? (e.g. the Delphi/unicode becomes the dominant influx).

And do we really want Lazarus' direction to fixate this for everybody?

Or what if they bring in a new Kylix principle with utf16 base type?

I'm very reluctant to make a choice here, and say insert conversions if
something changes. I would build in some flexibility and potential
differentiation from the start. 

At least in principle. As said, we can see which combinations are popular
for release time. 

  I've thought long and hard about this. Since the discussion what the
  dominant type should be won't stop anytime soon, and we probably  
  will have
  to support both UTF8 (*nix) and UTF16 (Windows and *nix/QT) as  
  basetypes in
  the long run, plus a time ANSI as legacy, the RTL has to be prepared  
  for it
  anyway, we might as well allow this on all platforms from the start.
  (actually releasing 

Re: [fpc-devel] String and UnicodeString and UTF8String

2011-01-10 Thread Hans-Peter Diettrich

Jonas Maebe schrieb:


In current Delphi is String synonym for base type UnicodeString UTF-16
AFAIU ATM in FPC is String synonym for AnsiString (as in previos 
versions of Delphi)

Are there any plans to change meaning of String type ?
(like Delphi to UnicodeString , or UTF8String?)


If/when this is done, it will only be with a compiler switch or directive.


AFAIR Delphi doesn't offer such a compiler option, because units with
different settings do not fit together (2 VCL versions would not be
sufficient).

I'm not sure about details, but the Delphi designers certainly
encountered problems that definitely forbid mixing string types. One
such problem may be a slowdown due to many implicit string conversions,
together with compiler warnings about possible losses on the conversion
back to Ansi, and real losses as I observed in VB years ago. Another one
may be the maintenance of duplicate (overloaded) procedures in the
standard libraries.


When FPC implements two distinct versions, adding another Unicode/Ansi
level to the unit output tree, both versions can be compiled from the
same source code, possibly using conditional compilation where necessary.

For my part, I'd be happy with two definitely different Ansi and UTF(8) 
string types, with automatic conversion. But even then it would be wise 
to add another string parameter type, like Delphi RawByteString, that 
accepts both Ansi and UTF-8 arguments without implicit conversions.


DoDi

___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel


Re: [fpc-devel] String and UnicodeString and UTF8String

2011-01-10 Thread Jonas Maebe

On 10 Jan 2011, at 16:27, Marco van de Voort wrote:

 In our previous episode, Jonas Maebe said:
 
 Why should a tstringlist force ansistring(0)?
 
 I mean that if you locally (for your units) set string=utf8string,
 TStringList still would be ansistring(0) or whatever the default becomes. 

I meant: why not use ansistring($) instead? You could even add a property 
to tstringlist that causes it to force the encoding of added strings to a 
particular code page whenever a string is added.

 Or does Delphi force it  to be that way?
 
 In D2009+ it is unicodestring, period. Everything is unicodestring (UTF16),
 ansistring (+ variants) are for legacy only, and people try to forget
 shortstring as quickly as possible.

Then a unicodestring version is certainly required, and an ansistring($) 
version would have to be called differently.

 I think in the planned Embarcadero cross-compile products, string will also
 be utf-16 on OS X and Linux.  If only because it is (1) easier, and windows
 remains dominant by far (including UTF16 assuming codebases) (2) they plan
 to target QT. 

I think it's a good decision to keep it the same everywhere, since 
string=unicodestring is not an opaque type in any way. As a result, choosing a 
different string type on other platforms would probably break lots of code 
again. And regardless of which toolkit you target on Mac OS X, conversions will 
probably happen anyway. The encoding used by Carbon and Cocoa is not specified 
anywhere afaik, and the CFString/NSString they are based on can use any 
encoding internally (I guess that's probably also UTF-16 for ease of 
processing).

 For me, having a mandatory UTF16 Unix is not an option, and a mandatory UTF8
 Windows neither.  (D2009+ incompatible)

I don't think UTF-16 everywhere would be a big problem.

 Conversion may indeed be required for output (input would only pass on  
 the encoding of the input if based on ansistring($))
 
 ansistring(0), system encoding would be more logical than $. $ is
 used more internally in string conversion routines and for strings that are
 not strings.

The fact that the formal return type is $ does not mean, afaik, that you 
also have to return something whose internal encoding is set to $. It can 
still be an ansistring(0), ansistring(OEMSTRING) or whatever. It simply means 
that the encoding won't be forced to anything in particular when you assign a 
value to the function result. If you then assign this function result to 
another variable (which may have a forced encoding), then a conversion will 
happen if the forced encoding is different from the actual one. If you assign 
it to another ansistring($), no encoding change will happen in any case, 
and the destination string will inherit the source's encoding.

 But what does that mean on Windows, where the console encoding is OEMSTRING
 and not ansistring(0) ?  

As I said: ansistring($). 

 but I think doing that only when necessary at the lowest level should be
 no problem.  Many existing frameworks work that way.
 
 It touches all places where you touch the OS. But indeed one could try to
 split this by doing the classes utf8 or tunicodestring depending on OS.

I'm not sure why you say indeed, because I did not propose to do that. I only 
proposed keeping as many RTL interfaces as possible in ansistring($) to 
have something that's
a) generic, and
b) with the least chance of resulting in encoding conversion

However...

 And we have to deal with Windows, where the default is UTF16.

... since Delphi 2009 uses (unicode)string everywhere, we need at least also 
unicode versions.

 Why ansistring(0) for base classes? OS-level interfaces: yes, but why  
 base classes?
 
 This is the core problem. What solution will do for everybody
 (legacy,Lazarus,Delphi/unicode?) or (ansistring(0), ansistring(cp_utf8) or
 TUnicodestring) ?
 
 And what do we do if e.g. Lazarus changes opinion and goes from utf8 to
 utf16 on Windows? (e.g. the Delphi/unicode becomes the dominant influx).
 
 And do we really want Lazarus' direction to fixate this for everybody?
 
 Or what if they bring in a new Kylix principle with utf16 base type?

A unicodestring version for Delphi-compatibility, and if required an 
ansistring($) version for all other purposes (afaik that would also work 
with legacy ansistring=ansistring(0), although it's not yet clear to me what 
happens if you pass an empty ansistring(0) to a rawbytestring var-parameter -- 
is it still nil like with current ansitrings, or can you somehow extract its 
declared encoding?)

 I agree that the RTL should work regardless of the used string  
 encoding, but I don't see why a particular encoding should be enforced  
 throughout the entire RTL rather than just using ansistring($)  
 almost everywhere.
 
 That only solves the 1-byte case.

It's true that you probably need a separate overloaded version for 
unicodestring (just like we currently also have separate 

Re: [fpc-devel] String and UnicodeString and UTF8String

2011-01-10 Thread Martin Schreiber
On Monday, 10. January 2011 16.27:19 Marco van de Voort wrote:

 And there are three such cases

 - normal FPC and Delph 2007- code :  ansistring(0)
 - Lazarus : ansistring=utf8
 - Delphi 2009+  UTF16.

- fpGUI: ansistring = utf-8
- MSEgui: existing FPC UnicodeString = utf-16

Martin
___
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel