Jeff Wormsley schrieb:
On 01/11/2011 11:10 AM, Hans-Peter Diettrich wrote:
UTF-8 combines an single (byte-based) storage type with lossless
encoding of full Unicode. Ansi and UCS2 (really UTF-16) only *look*
easier to handle in user code, but both will fail and require special
code whenever characters outside the assumed codepage may occur.
Preface: I don't write international apps, and probably won't for the
foreseeable future...
Then you may be bound to some legacy compiler version when the
stringhandling will change in some future time, as happened to Delphi
users. Continued support of AnsiString type(s) is not enough, because
legacy code can be broken by (eventually) required changes to "set of
char", sizeof(char) and PChar, sizeof(string) as opposed to
Length(string), upper/lower conversion, and many more not so obvious
consequences.
Isn't all of this concentration on trying to make strings have single
byte characters (who cares how they are encoded), using the argument
that it is somehow faster, incorrect for just about any modern
processor, including embedded CPU's such as ARM? It was my
understanding that 32 bit aligned access was always faster than byte
aligned access on just about any CPU FPC still supports.
See Marco's comment about data size etc.
The argument holds just fine for memory, but I don't really get the
speed argument. Maybe I'm missing something.
FPC (the compiler) still uses ShortStrings wherever possible, because
that was found out as the most efficient string representation. This is
partially due to the ASCII encoding of source code, except for string
literals. But like you, I'm not sure that this argument still holds on
modern hardware.
Speed loss may occur due to:
- data shuffling in general (total byte count)
- (implied) string conversion
- indexed access to MBCS[1] strings (including UTF-8/16)
[1] All encodings of variable "character" size discourage indexed access
to strings. Then "char" can have multiple meanings, as either
representing the (physical) string/array *element* size, or the
(logical) size of an *codepoint*. Until now most users, including you,
most probably don't realize that difference between phyiscal and logical
characters, and assume that sizeof(char) always is 1, and eventually
that sizeof(WideChar) is 2. IMO variables of type "char" should have at
least 3 (better 4) bytes in an Unicode environment, in order to maintain
the correspondence between physical and logical characters. As already
suggested the "packed" keyword could be applied to strings and char
arrays, to definitely signal to the user that indexed access should not
be used with such variables, unless a speed penalty is acceptable.
DoDi
_______________________________________________
fpc-devel maillist - fpc-devel@lists.freepascal.org
http://lists.freepascal.org/mailman/listinfo/fpc-devel