On Monday, 22 June 2020 at 03:24:37 UTC, Adam D. Ruppe wrote:
On Monday, 22 June 2020 at 03:17:54 UTC, Denis wrote:
- First, is there any difference between string, wstring and dstring?

Yes, they encode the same content differently in the bytes. If you cast it to ubyte[] and print that out you can see the difference.

- Are the characters of a string stored in memory by their Unicode codepoint(s), as opposed to some other encoding?

no, they are encoded in utf-8, 16, or 32 for string, wstring, and dstring respectively.

- Can a series of codepoints, appropriately padded to the required width, and terminated by a null character, be directly assigned to a string WITHOUT GOING THROUGH A DECODING / ENCODING TRANSLATION?

no, they must be encoded. Unicode code points are an abstract concept that must be encoded somehow to exist in memory (similar to the idea of a number).

OK, then that actually simplifies what's needed, because I won't need to decode the UTF-8, only validate it.

My code reads a UTF-8 encoded file into a buffer and validates, byte by byte, the UTF-8 encoding along with some additional validation. If I simply return the UTF-8 encoded string, there won't be another decoding/encoding done -- correct?

Reply via email to