On 16/08/17 11:05, Juha Manninen via Lazarus wrote:
2. Clean up the char type.
...
Why shouldn't there be a single char type that intuitively represents
a single character regardless of how many bytes are used to represent it.
What do you mean by "a single character"?
A "character" in Unicode can mean about 7 different things. Which one
is your pick?
This question is for everybody in this thread who used the word "character".
Are you making my points for me? If such a basic term as "character" means 7 different things then something is badly amiss. It should be fairly obvious that in this context, character = printable symbol - whilst for practical reasons allowing for format control characters such as a "end of line" and "end of string".

I believe that you need to go back to the idea that you have both an abstract representation of a character with a constant semantic, separate from the actual encoding and for which there may be many different and valid encodings. For example, using a somewhat dated comparison, a lower case latin alphabet letter 'a' should always have a constant semantic, but in ASCII is encoded as decimal 97, while in EBCDIC is encoded as decimal 129. Even though they have different binary values, the represent the same abstract character.

I want a 'char' type in Pascal to represent a character such as a lower case 'a' regardless of the encoding used. Indeed, for a program to be properly portable, the programmer should not have to care are the actual encoding - only that it is a lower case 'a'.

Hence my proposal that a character type should include an implicit or explicit attribute that records the encoding scheme used - which could vary from ASCII to UTF-32.

You can then go on to define a text string as an array of characters with the same encoding scheme.

Yes, in a world where we have to live with UTF8, UTF16, UTF32, legacy code
pages and Chinese variations on UTF8, that means that dynamic attributes
have to be included in the type. But isn't that the only way to have
consistent and intuitive character handling?
What do you mean? Chinese don't have a variation of UTF8.
UTF8 is global unambiguous encoding standard, part of Unicode.

I was referring to GB 18030 and that it has one, two and four byte code points.

The fundamental problem is that you want to hide the complexity of
Unicode by some magic String type of a compiler.
It is not possible. Unicode remains complex but the complexity is NOT
in encodings!
No, a codepoint's encoding is the easy part. For example I was easily
able to create a unit to support encoding agnostic code. See unit
LazUnicode in package LazUtils.
The complexity is elsewhere:
- "Character" composed of codepoints in precomposed and decomposed
(normalized) forms.
- Compare and sort text based on locale.
- Uppercase / Lowercase rules based on locale.
- Glyphs
- Graphemes
- etc.

I must admit I don't understand well those complex parts.
I do understand codeunits and codepoints, and I understand they are
the easy part.

Juha
The point I believe that you are missing is to consider that a character is an abstract symbol with a semantic independent of how it is encoded. Collation sequences are independent of encoding and should remain the same regardless of how a character set is encoded.
--
_______________________________________________
Lazarus mailing list
Lazarus@lists.lazarus-ide.org
https://lists.lazarus-ide.org/listinfo/lazarus

Reply via email to