Mattias Gaertner schrieb:
On Tue, 25 Nov 2014 11:53:00 +0100
Hans-Peter Diettrich <[email protected]> wrote:

[...]
Correction: *This* Char type needs to be extended.
Please specify.

The ThousandSeparator type is "Char", which does not work with
Russian in UTF-8. Well, at least if you want the non breakable space
instead of the normal space.
There are many cases where Char is enough.

You admit that there exist cases where Char is not enough :-]


There is a Pos overload for strings. Where is the flaw in Pos?
The flaw is the added overload with a Char parameter.

I use that a lot. It is faster than the string variant.
Why is that a flaw?

When working with SBCS you can assume that a Char can hold any entire character. This is not true with MBCS, like UTF-8.

With CP_ACP set to UTF-8 you cannot assign 'ä' to a Char, and search for it. Depending on your exact code, the compiler may not find out that this assignment is invalid, because it assigns only *part* of a multibyte sequence. A following Pos, with that partial character, can not always yield the *expected* result, it might find an 'ö' or 'ü' as well. In detail that Pos overload has no indication of the codepage of the Char, and consequently cannot enforce an eventually required conversion, to the encoding of the string parameter. The same considerations apply to eventual StringReplace (or similar) overloads.

Delphi users may think like you, that a Char is sufficient in such cases. They are right so far, as in Unicode Delphi a Char is a WideChar, and a String is UnicodeString, so that such optimizations work with BMP characters. [Users of MBCS/non-BMP character sets already know that Char is quite useless for text processing]

But compiling such code with FPC/Lazarus and the new RTL, where String is AnsiString, and the default encoding is UTF-8, the same code will not work properly. That's why I consider Char (=AnsiChar) dangerous in the new RTL, causing obscure program errors.

Removing Char, perphaps in some special compiler mode, would allow to identify all *possibly* wrong uses of the *generic* Char. Then the code can be fixed in various ways, by e.g. replacing Char by WideChar or UnicodeChar (4 bytes), removing overloads with Char parameters, or whatever else will prevent inadvertent misuse of constants, variables, fields or parameters of Char type.

Please note that Delphi compatibility is not a valid argument, as long as FPC/Lazarus differs in the declaration of the generic String and Char types. That's why Delphi made the Unicode move in *one* step, retyping both String and Char at the same time, and (effectively) deprecating AnsiString. This will at least make legacy code applicable to BMP encodings, where WideChar is sufficient to hold any character value, and legacy MBCS code will continue working without unexpected surprises.


Furthermore the Pos arguments should never be subject to automatic conversion, otherwise the returned index will be useless.

You can argue the same way in the direction: If it does not
automatically convert it will find crap.

That's why the *original* declaration, with both parameters of type String, will *allow* to identify and perform all required conversions. A Char type, without an encoding indicator, prevents such checks and conversions both at compiler level (in translating the call) and inside function code.


In the best case Char could be retyped into an string (substring),
That would be wrong in 99.9% of the cases.
Please give at least one example.

Retype "Char" to "String" and the compiler will bark. For example in
Graphics.

How is *graphic* information related to *text*? Using Char for Byte, only because using strings offers some coding comfort, is another flaw.

Delphi discourages since long the use of strings for holding anything but text. The continued abuse of strings, for other types of information, will now result in errors whenever an (implicit) string conversion occurs in some library routine, as can happen easily with encoded AnsiStrings. Unfortunately Delphi missed the chance to simply add an "unencoded" AnsiString encoding, which would allow to prevent any conversions of according string variables. The RawByteString type, despite its name, was added for quite a different purpose, *not* as a chance to safely store arbitrary bytes in such strings.

DoDi


--
_______________________________________________
Lazarus mailing list
[email protected]
http://lists.lazarus.freepascal.org/mailman/listinfo/lazarus

Reply via email to