On Wednesday, 1 June 2016 at 14:29:58 UTC, Andrei Alexandrescu
wrote:
On 06/01/2016 06:25 AM, Marc Schütz wrote:
On Tuesday, 31 May 2016 at 21:01:17 UTC, Andrei Alexandrescu
wrote:
The point is to operate on representation-independent entities
(Unicode code points) instead of low-level
representation-specific
artifacts (code units).
_Both_ are low-level representation-specific artifacts.
Maybe this is a misunderstanding. Representation = how things
are laid out in memory. What does associating numbers with
various Unicode symbols have to do with representation? --
Ok, if you define it that way, sure. I was thinking in terms of
the actual text: Unicode is a way to represent that text using a
variety of low-level representations: UTF8/NFC, UTF8/NFD,
unnormalized UTF8, UTF16 big/little endian x normalization, UTF32
x normalization, some other more obscure ones. From that
viewpoint, auto decoded char[] (= UTF8) is equivalent to dchar[]
(= UTF32). Neither of them is the actual text.
Both writing and the memory representation consist of fundamental
units. But there is no 1:1 relationship between the units of
char[] (UTF8 code units) or auto decoded strings (Unicode code
points) on the one hand, and the units of writing (graphemes) on
the other.