> On 8 dic 2015, at 10:07 p.m., EuanM <[email protected]> wrote: > > "No. a codepoint is the numerical value assigned to a character. An > "encoded character" is the way a codepoint is represented in bytes > using a given encoding." > > No. > > A codepoint may represent a component part of an abstract character, > or may represent an abstract character, or it may do both (but not > always at the same time). > > Codepoints represent a single encoding of a single concept. > > Sometimes that concept represents a whole abstract character. > Sometimes it represent part of an abstract character.
Well. I do not agree with this. I agree with the quote. Can you explain a bit more about what you mean by abstract character and concept? > > This is the key difference between Unicode and most character encodings. > > A codepoint does not always represent a whole character. > > On 7 December 2015 at 13:06, Henrik Johansen > <[email protected]> wrote: >> >> On 07 Dec 2015, at 1:05 , EuanM <[email protected]> wrote: >> >> Hi Henry, >> >> To be honest, at some point I'm going to long for the for the much >> more succinct semantics of healthcare systems and sports scoring and >> administration systems again. :-) >> >> codepoints are any of *either* >> - the representation of a component of an abstract character, *or* >> eg. "A" #(0041) as a component of >> - the sole representation of the whole of an abstract character *or* of >> - a representation of an abstract character provided for backwards >> compatibility which is more properly represented by a series of >> codepoints representing a composed character >> >> e.g. >> >> The "A" #(0041) as a codepoint can be: >> the sole representation of the whole of an abstract character "A" #(0041) >> >> The representation of a component of the composed (i.e. preferred) >> version of the abstract character Å #(0041 030a) >> >> Å (#00C5) represents one valid compatibility form of the abstract >> character Å which is most properly represented by #(0041 030a). >> >> Å (#212b) also represents one valid compatibility form of the abstract >> character Å which is most properly represented by #(0041 030a). >> >> With any luck, this satisfies both our semantic understandings of the >> concept of "codepoint" >> >> Would you agree with that? >> >> In Unicode, codepoints are *NOT* an abstract numerical representation >> of a text character. >> >> At least not as we generally understand the term "text character" from >> our experience of non-Unicode character mappings. >> >> >> I agree, they are numerical representations of what Unicode refers to as >> characters. >> >> >> codepoints represent "*encoded characters*" >> >> >> No. a codepoint is the numerical value assigned to a character. An "encoded >> character" is the way a codepoint is represented in bytes using a given >> encoding. >> >> and "a *text element* ... >> is represented by a sequence of one or more codepoints". (And the >> term "text element" is deliberately left undefined in the Unicode >> standard) >> >> Individual codepoints are very often *not* the encoded form of an >> abstract character that we are interested in. Unless we are >> communicating to or from another system (Which in some cases is the >> Smalltalk ByteString class) >> >> >> >> >> i.e. in other words >> >> *Some* individual codepoints *may* be a representation of a specific >> *abstract character*, but only in special cases. >> >> The general case in Unicode is that Unicode defines (a) >> representation(s) of a Unicode *abstract character*. >> >> The Unicode standard representation of an abstract character is a >> composed sequence of codepoints, where in some cases that sequence is >> as short as 1 codepoint. >> >> In other cases, Unicode has a compatibility alias of a single >> codepoint which is *also* a representation of an abstract character >> >> There are some cases where an abstract character can be represented by >> more than one single-codepoint compatibility codepoint. >> >> Cheers, >> Euan >> >> >> I agree you have a good grasp of the distinction between an abstract >> character (characters and character sequences which should be treated >> equivalent wrt, equality / sorting / display, etc.) and a character (which >> each have a code point assigned). >> That is besides the point both Sven and I tried to get through, which is the >> difference between a code point and the encoded form(s) of said code point. >> When you write: >> "and therefore encodable in UTF-8 as compatibility codepoint e9 hex >> and as the composed character #(0065 00b4) (all in hex) and as the >> same composed character as both >> #(feff 0065 00b4) and #(ffef 0065 00b4) when endianness markers are >> included" >> >> I's quite clear you confuse the two. 0xFEFF is the codepoint of the >> character used as bom. >> When you state that it can be written ffef (I assume you meant FFFE), you >> are again confusing the code point and its encoded value (an encoded value >> which only occurs in UTF16/32, no less). >> >> When this distinction is clear, it might be easier to see that value in that >> Strings are kept as Unicode code points arrays, and converted to encoded >> forms when entering/exiting the system. >> >> Cheers, >> Henry >> >
