Re: [Lazarus] Does Lazarus support a complete Unicode Component Library?

Hans-Peter Diettrich Thu, 17 Feb 2011 07:45:48 -0800

Graeme Geldenhuys schrieb:

Op 2011-02-17 11:28, Michael Schnell het geskryf:

On 02/17/2011 07:19 AM, Jürgen Hestermann wrote:

I often search for substrings, delete them from the string, insert
other strings at certain places, etc.
How can you do all this without knowledge of the internal structure of
the string?

This (magically :-) ) does work with UTF8.


NO, it doesn't! You can't use FPC's Copy(), Pos() etc reliably with
UTF-8 text,


You can, when you do it in the *right* way.

because thouse RTL functions work purely on ANSI text
(1-byte characters - speaking of String type text here) and don't know
about multi-byte characters, combining diacritics etc.

Pos() certainly works with MBCS as well, and you cannot expect thatcombining characters and ligatures are handled by the basic Unicodefunctions. When Copy requires an byte count, you can compute it from thedifference of the index positions of the involved substrings. It wouldbe better, though, when the basic procedures would not deal with countsor sizes at all.

Hence LCL and
fpGUI have special functions similar to RTL, that knows how to work with
UTF-8 encoded text. eg: UTF8Pos(), UTF8Length and UTF8Copy() etc functions.

This is a stupid idea, IMO. An "UTF8" prefix is inappropriate when itcomes to the distinction between physical and logical functionality.E.g. the number of *logical* (maybe visible) characters can bedetermined from any string encoding, and that function should have an*unique* name and (possibly) overloaded implementations. Likewise aSubString procedure could take two index positions, which can bedetermined without knowledge of the string encoding. This way stringinsertion or extraction do not require a re-parse of the strings, inorder to translate logical into physical indices and counts.

IMO we simply have to agree that Length() is a physical property, thenumber of elements in an array. A logical character count has a verydifferent meaning in string handling, and not even a *single* meaning,when we start dealing with ligatures and other Unicode stuff[1].

[1] In a mix of LTR and RTL parts a distinction between sequentialphysical and logical indices is required as well. The first RTLcodepoint physically follows the preceding LTR codepoint, but logically(on screen...) it precedes the *next* LTR codepoint. I only see oneproper solution to such quirks, by restricting the arguments of stringhandling functions to physical (array) indices. Logical increments ofsuch indices are at the discretion of the user, depending on hisunderstanding of the desired result. Library functions only can dealwith different encodings, but always will return physical indices.


DoDi


--
_______________________________________________
Lazarus mailing list
[email protected]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] Does Lazarus support a complete Unicode Component Library?

Reply via email to