On 13/08/17 12:18, Juha Manninen via Lazarus wrote:
Unicode was designed to solve exactly the problems caused by locale differences.
Why don't you use it?
I believe you effectively answer your own question in your preceding post:

Actually using the Windows system codepage is not safe any more.
The current Unicode system in Lazarus maps AnsiString to use UTF-8.
Text with Windows codepage must be converted explicitly.
This is a breaking change compared to the old Unicode suppport in
Lazarus 1.4.x + FPC 2.6.x.
If you are processing strings as "text" then you probably do not care how it is encoded and can live with "breaking changes". However, if, for some reason you are or need to be aware of how the text is encoded - or are using string types as a useful container for binary data then, types that sneak up on you with implicit type conversions or which have semantics that change between compilers or versions, are just another source of bugs.

PChar used to be a safe means to access binary data - but not anymore, especially if you move between FPC and Delphi. (One of my gripes is that the FCL still makes too much use of PChar instead of PByte with the resulting Delphi incompatibility). The "string" type also used to be a safe container for any sort of binary data, but when its definition can change between compilers and versions, it is now something to be avoided.

As a general rule, I now always use PByte for any sort of string that is binary, untyped or encoding to be determined. It works across compilers (FPC and Delphi) with consistent semantics and is safe for such use.

I also really like AnsiString from FCP 3.0 onwards. By making the encoding a dynamic attribute of the type, it means that I know what is in the container and can keep control.

I am sorry, but I would only even consider using Unicodestrings as a type (or the default string type) when I am just processing text for which the encoding is a don't care, such as a window caption, or for intensive text analysis. If I am reading/writing text from a file or database where the encoding is often implicit and may vary from the Unicode standard then my preference is for AnsiString. I can then read the text (e.g. from the file) into a (RawByteString) buffer, set the encoding and then process it safely while often avoiding the overhead from any transliteration. PByte comes into its own when the file contains a mixture of binary data and text.

Text files and databases tend to use UTF-8 or are encoded using legacy Windows Code pages. The Chinese also have GB18030. With a database, the encoding is usually known and AnsiString is a good way to read/write data and to convey the encoding, especially as databases usually use a variable length multi-byte encoding natively and not UTF-16/Unicode. With files, the text encoding is usually implicit and AnsiString is ideal for this as it lets you read in the text and then assign the (implicit) encoding to the string, or ensure the correct encoding when writing.

And anyway, I do most of my work in Linux, so why would I even want to bother myself with arrays of widechars when the default environment is UTF8?

We do need some stability and consistency in strings which, as someone else noted have been confused by Embarcadero. I would like to see that focused on AnsiString with UnicodeString being only for specialist use on Windows or when intensive text analysis makes a two byte encoding more efficient than a variable length multi-byte encoding.

Tony Whyman
MWA

--
_______________________________________________
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus

Reply via email to