Re: [Lazarus] Unicode (was Re: cwstring in arm-linux)

Hans-Peter Diettrich Fri, 21 Oct 2011 14:35:26 -0700

Michael Lutz schrieb:

Am 21.10.2011 00:20 schrieb Hans-Peter Diettrich:
The Ansi/UTF-16 migration is much easier than a migration to UTF-8. Whenyour legacy code can assume that every (visible) character is a Char, inan SBCS codepage, this is not different in UTF-16.
Ever heard of decomposed characters?

Right, that's one of the strange looking features of Unicode, related to*language* conventions.

Don't even think about collation, sorting, upper/lower-casing etc, there's
a reason the ICU library comes with 16 MB of data in addition to the code.


Right, see above.

For writing applications that are aware of different languages, Unicodeby itself is not very helpful - but Unicode allows to implement and uselibraries, dealing with everything beyond codepoints.

Conclusion: Every Unicode encoding has variable length characters. Code
points in UTF-32 are of fixed size, in UTF-16 come in two sizes, and in
UTF-8 come in four sizes (not six as the Unicode standard chose not
utilize a full 32-bit numerical space). Additionally, UTF-16 and UTF-32
are not endian neutral.

Data can be compressed in various ways, text is only one kind of suchdata. An application should select the most appropriate text encoding,and use exactly this one internally.

Conclusion 2: For storing a single visible character, a simple
char/wchar_t/wxChar/wxUniChar/whatever variable is not enough. You always
need a string to cater for decomposed characters.

Right, outside SBCS the code has to distinguish between logical andphysical characters, indices and counts. IMO it's not normally requiredto store exactly one logical character, so that (sub)strings should beused everywhere.

E.g. it may be more efficient to cut an string at the place, where acertain pattern has been found, into two or three substrings, and thencontinue working with the preceding or following string. This eliminatesthe need to find the starting position again, based on an eventuallyreturned logical character count.

AFAIK Java uses powerful substrings, which are references into the samestring, with their own physical offsets and sizes, reducing runtime andstorage requirements. Imagine what could be done with an equivalentSubString type in OPL, that can be used to refer to the preceding,matched and following parts of the entire string, without iteratingagain over the entire string. E.g.

  function Match(str, pattern: string): SubString;
  function Match(str, start, delim: string): SubString;
would allow for simple implementations of:
  MatchNext(str, substr): SubString;
  Replace(str, substr, newstr);
  Leader(str, substr): SubString;
  Trailer(str, substr): SubString;
etc.

DoDi


--
_______________________________________________
Lazarus mailing list
[email protected]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] Unicode (was Re: cwstring in arm-linux)

Reply via email to