Re: [Lazarus] UTF8 RTL for Windows

Hans-Peter Diettrich Tue, 25 Nov 2014 12:44:40 -0800

Mattias Gaertner schrieb:

On Tue, 25 Nov 2014 11:53:00 +0100
Hans-Peter Diettrich <[email protected]> wrote:

[...]

Correction: *This* Char type needs to be extended.

Please specify.


The ThousandSeparator type is "Char", which does not work with
Russian in UTF-8. Well, at least if you want the non breakable space
instead of the normal space.
There are many cases where Char is enough.


You admit that there exist cases where Char is not enough :-]

There is a Pos overload for strings. Where is the flaw in Pos?

The flaw is the added overload with a Char parameter.


I use that a lot. It is faster than the string variant.
Why is that a flaw?

When working with SBCS you can assume that a Char can hold any entirecharacter. This is not true with MBCS, like UTF-8.

With CP_ACP set to UTF-8 you cannot assign 'ä' to a Char, and search forit. Depending on your exact code, the compiler may not find out thatthis assignment is invalid, because it assigns only *part* of amultibyte sequence. A following Pos, with that partial character, cannot always yield the *expected* result, it might find an 'ö' or 'ü' aswell. In detail that Pos overload has no indication of the codepage ofthe Char, and consequently cannot enforce an eventually requiredconversion, to the encoding of the string parameter. The sameconsiderations apply to eventual StringReplace (or similar) overloads.

Delphi users may think like you, that a Char is sufficient in suchcases. They are right so far, as in Unicode Delphi a Char is a WideChar,and a String is UnicodeString, so that such optimizations work with BMPcharacters. [Users of MBCS/non-BMP character sets already know that Charis quite useless for text processing]

But compiling such code with FPC/Lazarus and the new RTL, where Stringis AnsiString, and the default encoding is UTF-8, the same code will notwork properly. That's why I consider Char (=AnsiChar) dangerous in thenew RTL, causing obscure program errors.

Removing Char, perphaps in some special compiler mode, would allow toidentify all *possibly* wrong uses of the *generic* Char. Then the codecan be fixed in various ways, by e.g. replacing Char by WideChar orUnicodeChar (4 bytes), removing overloads with Char parameters, orwhatever else will prevent inadvertent misuse of constants, variables,fields or parameters of Char type.

Please note that Delphi compatibility is not a valid argument, as longas FPC/Lazarus differs in the declaration of the generic String and Chartypes. That's why Delphi made the Unicode move in *one* step, retypingboth String and Char at the same time, and (effectively) deprecatingAnsiString. This will at least make legacy code applicable to BMPencodings, where WideChar is sufficient to hold any character value, andlegacy MBCS code will continue working without unexpected surprises.

Furthermore the Pos arguments should never be subject to automaticconversion, otherwise the returned index will be useless.
You can argue the same way in the direction: If it does not
automatically convert it will find crap.

That's why the *original* declaration, with both parameters of typeString, will *allow* to identify and perform all required conversions. AChar type, without an encoding indicator, prevents such checks andconversions both at compiler level (in translating the call) and insidefunction code.

In the best case Char could be retyped into an string (substring),

That would be wrong in 99.9% of the cases.

Please give at least one example.


Retype "Char" to "String" and the compiler will bark. For example in
Graphics.

How is *graphic* information related to *text*? Using Char for Byte,only because using strings offers some coding comfort, is another flaw.

Delphi discourages since long the use of strings for holding anythingbut text. The continued abuse of strings, for other types ofinformation, will now result in errors whenever an (implicit) stringconversion occurs in some library routine, as can happen easily withencoded AnsiStrings. Unfortunately Delphi missed the chance to simplyadd an "unencoded" AnsiString encoding, which would allow to prevent anyconversions of according string variables. The RawByteString type,despite its name, was added for quite a different purpose, *not* as achance to safely store arbitrary bytes in such strings.


DoDi


--
_______________________________________________
Lazarus mailing list
[email protected]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Re: [Lazarus] UTF8 RTL for Windows

Reply via email to