Graeme Geldenhuys schrieb:
Op 2011-02-17 11:28, Michael Schnell het geskryf:
On 02/17/2011 07:19 AM, Jürgen Hestermann wrote:
I often search for substrings, delete them from the string, insert
other strings at certain places, etc.
How can you do all this without knowledge of the internal structure of
the string?
This (magically :-) ) does work with UTF8.

NO, it doesn't! You can't use FPC's Copy(), Pos() etc reliably with
UTF-8 text,

You can, when you do it in the *right* way.

because thouse RTL functions work purely on ANSI text
(1-byte characters - speaking of String type text here) and don't know
about multi-byte characters, combining diacritics etc.

Pos() certainly works with MBCS as well, and you cannot expect that combining characters and ligatures are handled by the basic Unicode functions. When Copy requires an byte count, you can compute it from the difference of the index positions of the involved substrings. It would be better, though, when the basic procedures would not deal with counts or sizes at all.

Hence LCL and
fpGUI have special functions similar to RTL, that knows how to work with
UTF-8 encoded text. eg: UTF8Pos(), UTF8Length and UTF8Copy() etc functions.

This is a stupid idea, IMO. An "UTF8" prefix is inappropriate when it comes to the distinction between physical and logical functionality. E.g. the number of *logical* (maybe visible) characters can be determined from any string encoding, and that function should have an *unique* name and (possibly) overloaded implementations. Likewise a SubString procedure could take two index positions, which can be determined without knowledge of the string encoding. This way string insertion or extraction do not require a re-parse of the strings, in order to translate logical into physical indices and counts.

IMO we simply have to agree that Length() is a physical property, the number of elements in an array. A logical character count has a very different meaning in string handling, and not even a *single* meaning, when we start dealing with ligatures and other Unicode stuff[1].

[1] In a mix of LTR and RTL parts a distinction between sequential physical and logical indices is required as well. The first RTL codepoint physically follows the preceding LTR codepoint, but logically (on screen...) it precedes the *next* LTR codepoint. I only see one proper solution to such quirks, by restricting the arguments of string handling functions to physical (array) indices. Logical increments of such indices are at the discretion of the user, depending on his understanding of the desired result. Library functions only can deal with different encodings, but always will return physical indices.

DoDi


--
_______________________________________________
Lazarus mailing list
[email protected]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Reply via email to