Jeff Wormsley schrieb:
On 01/11/2011 11:10 AM, Hans-Peter Diettrich wrote:

UTF-8 combines an single (byte-based) storage type with lossless encoding of full Unicode. Ansi and UCS2 (really UTF-16) only *look* easier to handle in user code, but both will fail and require special code whenever characters outside the assumed codepage may occur.

Preface: I don't write international apps, and probably won't for the foreseeable future...

Then you may be bound to some legacy compiler version when the stringhandling will change in some future time, as happened to Delphi users. Continued support of AnsiString type(s) is not enough, because legacy code can be broken by (eventually) required changes to "set of char", sizeof(char) and PChar, sizeof(string) as opposed to Length(string), upper/lower conversion, and many more not so obvious consequences.

Isn't all of this concentration on trying to make strings have single byte characters (who cares how they are encoded), using the argument that it is somehow faster, incorrect for just about any modern processor, including embedded CPU's such as ARM? It was my understanding that 32 bit aligned access was always faster than byte aligned access on just about any CPU FPC still supports.

See Marco's comment about data size etc.

The argument holds just fine for memory, but I don't really get the speed argument. Maybe I'm missing something.

FPC (the compiler) still uses ShortStrings wherever possible, because that was found out as the most efficient string representation. This is partially due to the ASCII encoding of source code, except for string literals. But like you, I'm not sure that this argument still holds on modern hardware.

Speed loss may occur due to:
- data shuffling in general (total byte count)
- (implied) string conversion
- indexed access to MBCS[1] strings (including UTF-8/16)

[1] All encodings of variable "character" size discourage indexed access to strings. Then "char" can have multiple meanings, as either representing the (physical) string/array *element* size, or the (logical) size of an *codepoint*. Until now most users, including you, most probably don't realize that difference between phyiscal and logical characters, and assume that sizeof(char) always is 1, and eventually that sizeof(WideChar) is 2. IMO variables of type "char" should have at least 3 (better 4) bytes in an Unicode environment, in order to maintain the correspondence between physical and logical characters. As already suggested the "packed" keyword could be applied to strings and char arrays, to definitely signal to the user that indexed access should not be used with such variables, unless a speed penalty is acceptable.

DoDi

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Reply via email to