On Wednesday, 20 November 2013 at 00:01:00 UTC, Andrei Alexandrescu wrote:
(c) A variety of text functions currently suffer because we don't make the difference between validated UTF strings and potentially invalid ones.

I think it is fair to always assume that a char[] is a valid UTF-8 string, and instead perform the validation when creating/filling the string from a non-validated source.

Take std.file.read() as an example; it returns void[], but has a validating counterpart in std.file.readText().

I think we should use ubyte[] to a greater extent for data which is potentially *not* valid UTF. Examples include interfacing with C functions, where I think there is a tendency towards always translating C char to D char, when they are in fact not equivalent. Another example is, again, std.file.read(), which currently returns void[]. I guess it is a matter of taste, but I think ubyte[] would be more appropriate here, since you can actually use it for something without casting it first.

The transition from string to ubyte[] is already made simple by std.string.representation. We should offer an equally simple and convenient way to do the opposite transformation. In one of my current projects, I am using this function:

  inout(char)[] asString(inout(ubyte)[] data) @safe pure
  {
    auto s = cast(typeof(return)) data;
    import std.utf: validate;
    validate(s);
    return s;
  }

This could easily be written as a template, to accept wider encodings as well, and I think it would be a nice addition to Phobos.

Lars

Reply via email to