Michael Lutz schrieb:
Am 21.10.2011 00:20 schrieb Hans-Peter Diettrich:
The Ansi/UTF-16 migration is much easier than a migration to UTF-8. When your legacy code can assume that every (visible) character is a Char, in an SBCS codepage, this is not different in UTF-16.

Ever heard of decomposed characters?

Right, that's one of the strange looking features of Unicode, related to *language* conventions.

Don't even think about collation, sorting, upper/lower-casing etc, there's
a reason the ICU library comes with 16 MB of data in addition to the code.

Right, see above.

For writing applications that are aware of different languages, Unicode by itself is not very helpful - but Unicode allows to implement and use libraries, dealing with everything beyond codepoints.

Conclusion: Every Unicode encoding has variable length characters. Code
points in UTF-32 are of fixed size, in UTF-16 come in two sizes, and in
UTF-8 come in four sizes (not six as the Unicode standard chose not
utilize a full 32-bit numerical space). Additionally, UTF-16 and UTF-32
are not endian neutral.

Data can be compressed in various ways, text is only one kind of such data. An application should select the most appropriate text encoding, and use exactly this one internally.


Conclusion 2: For storing a single visible character, a simple
char/wchar_t/wxChar/wxUniChar/whatever variable is not enough. You always
need a string to cater for decomposed characters.

Right, outside SBCS the code has to distinguish between logical and physical characters, indices and counts. IMO it's not normally required to store exactly one logical character, so that (sub)strings should be used everywhere.

E.g. it may be more efficient to cut an string at the place, where a certain pattern has been found, into two or three substrings, and then continue working with the preceding or following string. This eliminates the need to find the starting position again, based on an eventually returned logical character count.

AFAIK Java uses powerful substrings, which are references into the same string, with their own physical offsets and sizes, reducing runtime and storage requirements. Imagine what could be done with an equivalent SubString type in OPL, that can be used to refer to the preceding, matched and following parts of the entire string, without iterating again over the entire string. E.g.
  function Match(str, pattern: string): SubString;
  function Match(str, start, delim: string): SubString;
would allow for simple implementations of:
  MatchNext(str, substr): SubString;
  Replace(str, substr, newstr);
  Leader(str, substr): SubString;
  Trailer(str, substr): SubString;
etc.

DoDi


--
_______________________________________________
Lazarus mailing list
[email protected]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Reply via email to