Re: [Lazarus] String vs WideString

Tony Whyman via Lazarus Mon, 14 Aug 2017 02:54:30 -0700


On 13/08/17 12:18, Juha Manninen via Lazarus wrote:

Unicode was designed to solve exactly the problems caused by locale differences.
Why don't you use it?

I believe you effectively answer your own question in your preceding post:

Actually using the Windows system codepage is not safe any more.
The current Unicode system in Lazarus maps AnsiString to use UTF-8.
Text with Windows codepage must be converted explicitly.
This is a breaking change compared to the old Unicode suppport in
Lazarus 1.4.x + FPC 2.6.x.

If you are processing strings as "text" then you probably do not carehow it is encoded and can live with "breaking changes". However, if, forsome reason you are or need to be aware of how the text is encoded - orare using string types as a useful container for binary data then, typesthat sneak up on you with implicit type conversions or which havesemantics that change between compilers or versions, are just anothersource of bugs.

PChar used to be a safe means to access binary data - but not anymore,especially if you move between FPC and Delphi. (One of my gripes is thatthe FCL still makes too much use of PChar instead of PByte with theresulting Delphi incompatibility). The "string" type also used to be asafe container for any sort of binary data, but when its definition canchange between compilers and versions, it is now something to be avoided.

As a general rule, I now always use PByte for any sort of string that isbinary, untyped or encoding to be determined. It works across compilers(FPC and Delphi) with consistent semantics and is safe for such use.

I also really like AnsiString from FCP 3.0 onwards. By making theencoding a dynamic attribute of the type, it means that I know what isin the container and can keep control.

I am sorry, but I would only even consider using Unicodestrings as atype (or the default string type) when I am just processing text forwhich the encoding is a don't care, such as a window caption, or forintensive text analysis. If I am reading/writing text from a file ordatabase where the encoding is often implicit and may vary from theUnicode standard then my preference is for AnsiString. I can then readthe text (e.g. from the file) into a (RawByteString) buffer, set theencoding and then process it safely while often avoiding the overheadfrom any transliteration. PByte comes into its own when the filecontains a mixture of binary data and text.

Text files and databases tend to use UTF-8 or are encoded using legacyWindows Code pages. The Chinese also have GB18030. With a database, theencoding is usually known and AnsiString is a good way to read/writedata and to convey the encoding, especially as databases usually use avariable length multi-byte encoding natively and not UTF-16/Unicode.With files, the text encoding is usually implicit and AnsiString isideal for this as it lets you read in the text and then assign the(implicit) encoding to the string, or ensure the correct encoding whenwriting.

And anyway, I do most of my work in Linux, so why would I even want tobother myself with arrays of widechars when the default environment is UTF8?

We do need some stability and consistency in strings which, as someoneelse noted have been confused by Embarcadero. I would like to see thatfocused on AnsiString with UnicodeString being only for specialist useon Windows or when intensive text analysis makes a two byte encodingmore efficient than a variable length multi-byte encoding.


Tony Whyman
MWA

--
_______________________________________________
Lazarus mailing list
[email protected]
https://lists.lazarus-ide.org/listinfo/lazarus

Re: [Lazarus] String vs WideString

Reply via email to