Re: [Lazarus] substr return wrong string with some utf8 char

Hans-Peter Diettrich Mon, 14 Feb 2011 13:21:51 -0800

Michael Schnell schrieb:

AFAIK, the decision to use UTF8 is due to Linux using this encoding andso no conversion is done in the LCL system API.

IMO more important: no new string and char type (Wide...) is required,no duplicate set of stringhandling procedures. This may be essential fordatabases and communication as well.

This of course is badwith Windows, as here the API uses UTF16 and everything needs to berecoded in the LC System API on entry and exit.

The overhead may be neglectable in direct API calls, when these do realwork. Strings in (visual) components can be converted once, into theinternally used (OS display conforming) representation, and again theconversion overhead can be low until undetectable in the GUI.

Supposedly doingdifferent string types - UTF8String vs (a reference counting version ofUTF-16-encoded) WideString - for Linux and Windows at the LCL-user-Codeinterface is too confusing.

A *portable* UTF string implementation should be restricted, eliminatingdirect and indexed access to chars (which become substrings). Adedicated UTF16 class/type can be added at any time, as an optional package.

OTOH I agree that the weak (non-existing) distinction between Ansi andUTF8 strings is not pleasing. But here I'd establish a strong boundarybetween general (Unicode=UTF8) strings, and application specific stringsof a single (immutable) codepage - remember that "Ansi" is not a singlespecific encoding, instead it's a collection of single-byte-charencodings, including UTF-8. Then the user can choose a specific codepage(or UTF-16) for use inside his application, with e.g. an AppString type.Then it's clear where conversions are required and have to be insertedautomatically by the compiler.

The Delphi model, with differently encoded strings in the same stringtype, can result in much uncontrollable conversion overhead, easilyoutweighting the few possible optimizations with current AnsiStrings(assuming SBCS[1] only). The new ABI also is incompatible with existingDLLs of earlier Delphi/BCB versions, causing trouble with third-partycomponents that are not available in the new ABI. Okay, no such problemsexist with open source components, but not all Lazarus add-ons or appsare necessarily open source.

[1] With MBCS charsets the same rules apply as to UTF-8, so that UTF-8can immediately replace all MBCS encodings. So the decision about newstring types *only* affects current SBCS Ansi users, even ASCII usersare not affected.


DoDi


--
_______________________________________________
Lazarus mailing list
[email protected]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] substr return wrong string with some utf8 char

Reply via email to