On Wednesday, 18 July 2018 at 22:44:33 UTC, aliak wrote:
On Wednesday, 18 July 2018 at 12:10:04 UTC, Seb wrote:
On Wednesday, 18 July 2018 at 03:40:08 UTC, Jon Degenhardt
[...]
[...]
That point is still open for discussion, but at the moment
rcstring isn't a range and the user has to declare what kind
of range he/she wants with e.g. `.by!char`
However, one current idea is that for some use cases (e.g.
comparison) it might not matter and an application could add
overloads for rcstrings.
Maybe I misunderstood but you mean that for comparisons the
encoding doesn't matter only right? But that does not preclude
normalization, e.g. unicode defines U+00F1 as equal to the
sequence U+006E U+0303 and that would work as long as they're
normalized (from what I understand at least) and regardless of
whether you compare char/wchar/dchars.
The current idea is to do the same this for Phobos - though I
have to say that I'm not really looking forward to adding 200
overloads to Phobos :/
[...]
That's the long-term goal of the collections project.
However, with rcstring being the first big use case for it,
the idea was to push rcstring forward and by that discover all
remaining issues with the Array class.
Also the interface of rcstring is rather contained (and
doesn't expose the underlying storage to the user), which
allows us to iterate over/improve upon the Array design.
[...]
Hehe, it's intended to solve both problems (auto-decoding by
default and @nogc) at the same time.
However, it looks like to me like there isn't a good solution
to the auto-decoding problem that is convenient to use for the
user and doesn't sacrifice on performance.
How about a compile time flag that can make things more
convenient:
auto str1 = latin1("literal");
rcstring!Latin1 latin1string(string str) {
return rcstring!Latin1(str);
}
auto str2 = utf8("åsm");
// ...
struct rcstring(Encoding = Unknown) {
ubyte[] data;
bool normalized = false;
static if (is(Encoding == Latin1)) {
// by char range interface implementation
} else if (is(Encoding == Utf8)) {
// byGrapheme range interface implementation?
} else {
// no range interface implementation
}
bool opEquals()(auto ref const S lhs) const {
static if (!is(Encoding == Latin1)) {
return data == lhs.data;
} else {
return normalized() == lhs.normalized()
}
}
}
And now most ranges will work correctly. And then some of the
algorithms that don't need to use byGrapheme but just need
normalized code points to work correctly can do that and that
seems like all the special handling you'll need inside range
algorithms?
Then:
readText("foo".latin1);
"ä".utf8.split.join("|");
??
Cheers,
- Ali
I like this approach, `rcstring.by!` is to verbose for my taste
and quite annoying for day to day usage.
I think rcstring should be aliased by concrete implementation
like ansi, uft8, utf16, utf32. Those aliases should be ranges and
maybe subtype their respective string, wstring, dstring so they
can be transparently used for non-range based APIs (this required
dip1000 for @safe).
The take away is that rcstring by itself does not satisfy the
usability criteria, and probably should focus on performance and
flexibility to be used as a building block for higher level
constructs that are easier to use and safer in regards to how
they work with the string type they hold.