On Monday, 22 June 2020 at 03:24:37 UTC, Adam D. Ruppe wrote:
On Monday, 22 June 2020 at 03:17:54 UTC, Denis wrote:
- First, is there any difference between string, wstring and
dstring?
Yes, they encode the same content differently in the bytes. If
you cast it to ubyte[] and print that out you can see the
difference.
- Are the characters of a string stored in memory by their
Unicode codepoint(s), as opposed to some other encoding?
no, they are encoded in utf-8, 16, or 32 for string, wstring,
and dstring respectively.
- Can a series of codepoints, appropriately padded to the
required width, and terminated by a null character, be
directly assigned to a string WITHOUT GOING THROUGH A DECODING
/ ENCODING TRANSLATION?
no, they must be encoded. Unicode code points are an abstract
concept that must be encoded somehow to exist in memory
(similar to the idea of a number).
OK, then that actually simplifies what's needed, because I won't
need to decode the UTF-8, only validate it.
My code reads a UTF-8 encoded file into a buffer and validates,
byte by byte, the UTF-8 encoding along with some additional
validation. If I simply return the UTF-8 encoded string, there
won't be another decoding/encoding done -- correct?