Re: [Lazarus] String vs WideString

Tony Whyman via Lazarus Thu, 17 Aug 2017 04:10:03 -0700

On 16/08/17 11:05, Juha Manninen via Lazarus wrote:

2. Clean up the char type.

...
Why shouldn't there be a single char type that intuitively represents
a single character regardless of how many bytes are used to represent it.

What do you mean by "a single character"?
A "character" in Unicode can mean about 7 different things. Which one
is your pick?
This question is for everybody in this thread who used the word "character".

Are you making my points for me? If such a basic term as "character"means 7 different things then something is badly amiss. It should befairly obvious that in this context, character = printable symbol -whilst for practical reasons allowing for format control characters suchas a "end of line" and "end of string".

I believe that you need to go back to the idea that you have both anabstract representation of a character with a constant semantic,separate from the actual encoding and for which there may be manydifferent and valid encodings. For example, using a somewhat datedcomparison, a lower case latin alphabet letter 'a' should always have aconstant semantic, but in ASCII is encoded as decimal 97, while inEBCDIC is encoded as decimal 129. Even though they have different binaryvalues, the represent the same abstract character.

I want a 'char' type in Pascal to represent a character such as a lowercase 'a' regardless of the encoding used. Indeed, for a program to beproperly portable, the programmer should not have to care are the actualencoding - only that it is a lower case 'a'.

Hence my proposal that a character type should include an implicit orexplicit attribute that records the encoding scheme used - which couldvary from ASCII to UTF-32.

You can then go on to define a text string as an array of characterswith the same encoding scheme.

Yes, in a world where we have to live with UTF8, UTF16, UTF32, legacy code
pages and Chinese variations on UTF8, that means that dynamic attributes
have to be included in the type. But isn't that the only way to have
consistent and intuitive character handling?

What do you mean? Chinese don't have a variation of UTF8.
UTF8 is global unambiguous encoding standard, part of Unicode.

I was referring to GB 18030 and that it has one, two and four byte codepoints.


The fundamental problem is that you want to hide the complexity of
Unicode by some magic String type of a compiler.
It is not possible. Unicode remains complex but the complexity is NOT
in encodings!
No, a codepoint's encoding is the easy part. For example I was easily
able to create a unit to support encoding agnostic code. See unit
LazUnicode in package LazUtils.
The complexity is elsewhere:
- "Character" composed of codepoints in precomposed and decomposed
(normalized) forms.
- Compare and sort text based on locale.
- Uppercase / Lowercase rules based on locale.
- Glyphs
- Graphemes
- etc.

I must admit I don't understand well those complex parts.
I do understand codeunits and codepoints, and I understand they are
the easy part.

Juha

The point I believe that you are missing is to consider that a characteris an abstract symbol with a semantic independent of how it is encoded.Collation sequences are independent of encoding and should remain thesame regardless of how a character set is encoded.

--
_______________________________________________
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus

Re: [Lazarus] String vs WideString

Reply via email to