Re: [Pharo-dev] Unicode Support

Henrik Johansen Mon, 07 Dec 2015 05:08:20 -0800

> On 07 Dec 2015, at 1:05 , EuanM <[email protected]> wrote:
> 
> Hi Henry,
> 
> To be honest, at some point I'm going to long for the for the much
> more succinct semantics of healthcare systems and sports scoring and
> administration systems again.  :-)
> 
> codepoints are any of *either*
>  - the representation of a component of an abstract character, *or*
> eg. "A" #(0041) as a component of
>  - the sole representation of the whole of an abstract character *or* of
> -  a representation of an abstract character provided for backwards
> compatibility which is more properly represented by a series of
> codepoints representing a composed character
> 
> e.g.
> 
> The "A" #(0041) as a codepoint can be:
> the sole representation of the whole of an abstract character "A" #(0041)
> 
> The representation of a component of the composed (i.e. preferred)
> version of the abstract character Å #(0041 030a)
> 
> Å (#00C5) represents one valid compatibility form of the abstract
> character Å which is most properly represented by #(0041 030a).
> 
> Å (#212b) also represents one valid compatibility form of the abstract
> character Å which is most properly represented by #(0041 030a).
> 
> With any luck, this satisfies both our semantic understandings of the
> concept of "codepoint"
> 
> Would you agree with that?
> 
> In Unicode, codepoints are *NOT* an abstract numerical representation
> of a text character.
> 
> At least not as we generally understand the term "text character" from
> our experience of non-Unicode character mappings.


I agree, they are numerical representations of what Unicode refers to as 
characters.

> 
> codepoints represent "*encoded characters*"

No. a codepoint is the numerical value assigned to a character. An "encoded 
character" is the way a codepoint is represented in bytes using a given 
encoding.

> and "a *text element* ...
> is represented by a sequence of one or more codepoints".  (And the
> term "text element" is deliberately left undefined in the Unicode
> standard)
> 
> Individual codepoints are very often *not* the encoded form of an
> abstract character that we are interested in.  Unless we are
> communicating to or from another system  (Which in some cases is the
> Smalltalk ByteString class)


> 
> i.e. in other words
> 
> *Some* individual codepoints *may* be a representation of a specific
> *abstract character*, but only in special cases.
> 
> The general case in Unicode is that Unicode defines (a)
> representation(s) of a Unicode *abstract character*.
> 
> The Unicode standard representation of an abstract character is a
> composed sequence of codepoints, where in some cases that sequence is
> as short as 1 codepoint.
> 
> In other cases, Unicode has a compatibility alias of a single
> codepoint which is *also* a representation of an abstract character
> 
> There are some cases where an abstract character can be represented by
> more than one single-codepoint compatibility codepoint.
> 
> Cheers,
>  Euan

I agree you have a good grasp of the distinction between an abstract character 
(characters and character sequences which should be treated equivalent wrt, 
equality / sorting / display, etc.) and a character (which each have a code 
point assigned).
That is besides the point both Sven and I tried to get through, which is the 
difference between a code point and the encoded form(s) of said code point.
When you write:
"and therefore encodable in UTF-8 as compatibility codepoint e9 hex
and as the composed character #(0065 00b4) (all in hex) and as the
same composed character as both
#(feff 0065 00b4) and #(ffef 0065 00b4) when endianness markers are included"

I's quite clear you confuse the two. 0xFEFF is the codepoint of the character 
used as bom.
When you state that it can be written ffef (I assume you meant FFFE), you are 
again confusing the code point and its encoded value (an encoded value which 
only occurs in UTF16/32, no less).

When this distinction is clear, it might be easier to see that value in that 
Strings are kept as Unicode code points arrays, and converted to encoded forms 
when entering/exiting the system.

Cheers,
Henry

signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: [Pharo-dev] Unicode Support

Reply via email to