Re: [fpc-devel] String and UnicodeString and UTF8String

Hans-Peter Diettrich Wed, 12 Jan 2011 09:11:29 -0800

Jeff Wormsley schrieb:

On 01/11/2011 11:10 AM, Hans-Peter Diettrich wrote:
UTF-8 combines an single (byte-based) storage type with losslessencoding of full Unicode. Ansi and UCS2 (really UTF-16) only *look*easier to handle in user code, but both will fail and require specialcode whenever characters outside the assumed codepage may occur.
Preface: I don't write international apps, and probably won't for theforeseeable future...

Then you may be bound to some legacy compiler version when thestringhandling will change in some future time, as happened to Delphiusers. Continued support of AnsiString type(s) is not enough, becauselegacy code can be broken by (eventually) required changes to "set ofchar", sizeof(char) and PChar, sizeof(string) as opposed toLength(string), upper/lower conversion, and many more not so obviousconsequences.

Isn't all of this concentration on trying to make strings have singlebyte characters (who cares how they are encoded), using the argumentthat it is somehow faster, incorrect for just about any modernprocessor, including embedded CPU's such as ARM? It was myunderstanding that 32 bit aligned access was always faster than bytealigned access on just about any CPU FPC still supports.


See Marco's comment about data size etc.

The argument holds just fine for memory, but I don't really get thespeed argument. Maybe I'm missing something.

FPC (the compiler) still uses ShortStrings wherever possible, becausethat was found out as the most efficient string representation. This ispartially due to the ASCII encoding of source code, except for stringliterals. But like you, I'm not sure that this argument still holds onmodern hardware.


Speed loss may occur due to:
- data shuffling in general (total byte count)
- (implied) string conversion
- indexed access to MBCS[1] strings (including UTF-8/16)

[1] All encodings of variable "character" size discourage indexed accessto strings. Then "char" can have multiple meanings, as eitherrepresenting the (physical) string/array *element* size, or the(logical) size of an *codepoint*. Until now most users, including you,most probably don't realize that difference between phyiscal and logicalcharacters, and assume that sizeof(char) always is 1, and eventuallythat sizeof(WideChar) is 2. IMO variables of type "char" should have atleast 3 (better 4) bytes in an Unicode environment, in order to maintainthe correspondence between physical and logical characters. As alreadysuggested the "packed" keyword could be applied to strings and chararrays, to definitely signal to the user that indexed access should notbe used with such variables, unless a speed penalty is acceptable.


DoDi

_______________________________________________
fpc-devel maillist  -  fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel

Re: [fpc-devel] String and UnicodeString and UTF8String

Reply via email to