Re: Thin UTF8 string wrapper

Jonathan M Davis via Digitalmars-d-learn Fri, 06 Dec 2019 19:24:48 -0800

On Friday, December 6, 2019 9:48:21 AM MST Joseph Rushton Wakeling via 
Digitalmars-d-learn wrote:
> Hello folks,
>
> I have a use-case that involves wanting to create a thin struct
> wrapper of underlying string data (the idea is to have a type
> that guarantees that the string has certain desirable properties).
>
> The string is required to be valid UTF-8.  The question is what
> the most useful API is to expose from the wrapper: a sliceable
> random-access range?  A getter plus `alias this` to just treat it
> like a normal string from the reader's point of view?
>
> One factor that I'm not sure how to address w.r.t. a full range
> API is how to handle iterating over elements: presumably they
> should be iterated over as `dchar`, but how to implement a
> `front` given that `std.encoding` gives no way to decode the
> initial element of the string that doesn't also pop it off the
> front?
>
> I'm also slightly disturbed to see that `std.encoding.codePoints`
> requires `immutable(char)[]` input: surely it should operate on
> any range of `char`?
>
> I'm inclining towards the "getter + `alias this`" approach, but I
> thought I'd throw the problem out here to see if anyone has any
> good experience and/or advice.
>
> Thanks in advance for any thoughts!


The module to look at here is std.utf, not std.encoding. decode and
decodeFront can be used to get a code point if that's what you want, whereas
byCodeUnit and byUTF can be used to get a range over code units or code
points. There's also byCodePoint and byGrapheme in std.uni. std.encoding is
old and arguably needs an overhaul. I don't think that I've ever done
anything with it other than for dealing with BOMs.

If you provide a range of UTF-8 code units, then it will just work with any
code that's written to work with a range of any character type, whereas if
you specifically need to have it be a range of code points or graphemes,
then using the wrappers from std.utf or std.uni will get you that. And there
really isn't any reason to restrict the operations on a range of char the
way that std.range.primitives does for string. If you're dealing with a
function that was specifically written to operate on any range of
characters, then it's unnecessary, and if it's just a normal range-based
function which isn't specialized for ranges of characters, then it's going
to iterate over whatever the element type of the range is. So, you'll need
to use a wrapper like byUTF, byCodePoint, or byGrapheme to get whatever the
correct behavior is depending on what you're trying to do.

The main hiccup is that a lot of Phobos is basically written with the idea
that ranges of characters will be ranges of dchar. Some of Phobos has been
fixed so that it doesn't, but plenty of it hasn't been. However, what that
usually means is that the code just operates on the element type and
special-cases for narrow strings, or it's specifically written to operate on
ranges of dchar. For cases like that, byUTF!dchar or byCodePoint will likely
work; alternatively, you can provide a way to access the underlying string
and just have them operate directly on the string, but depending on what
you're trying to do with your wrapper, exposing the underlying string may or
may not be a problem (given that string has immutable elements though, it's
probably fine so long as you don't provide a reference to the string
itself).

In general, I'd strongly advise against using alias this with range-based
code (or really, generic code in general). Depending, it _can_ work, but
it's also an easy source of bugs. Unless the code forces the conversion,
what you can easily get is some of the code operating directly on the type
and some of it doing the implicit conversion to operate on the type. Best
case, that results in compilation errors, but it could also result in subtle
bugs. It's far less error-prone to require that the conversion be done
explicitly.

So, if all you're really trying to do is provide some guarantees about how
the string was constructed but then are looking to essentially just have it
be a string after that, it would probably be simplest to make it so that
your wrapper type doesn't have much in the way of operations and that it
just provides a property to access the underlying string. Then the type
itself isn't a range, and any code that wants to operate on the data can
just use the property to get the underlying string and use it as a string
after that. That approach basically completely sidesteps the issue of how to
treat the data as a range, since you get the normal behavior for strings for
any code that does much more than just pass around the data. You _do_ lose
the knowledge that the wrapper type gave you about the state of the string
once you start actually operating on the data, but once you start operating
on it, that knowledge is probably no longer valid anyway (especially if
you're passing it to a function which is going to return a wrapper range to
mutate the elements in the range rather than something like find which just
looks at the range).

- Jonathan M Davis

Re: Thin UTF8 string wrapper

Reply via email to