On Thursday, September 6, 2018 3:15:59 PM MDT aliak via Digitalmars-d wrote: > On Thursday, 6 September 2018 at 20:15:22 UTC, Jonathan M Davis > > wrote: > > On Thursday, September 6, 2018 1:04:45 PM MDT aliak via > > > > Digitalmars-d wrote: > >> D makes the code-point case default and hence that becomes the > >> simplest to use. But unfortunately, the only thing I can think > >> of > >> that requires code point representations is when dealing > >> specifically with unicode algorithms (normalization, etc). > >> Here's > >> a good read on code points: > >> https://manishearth.github.io/blog/2017/01/14/stop-ascribing-meaning-to > >> -un icode-code-points/ - > >> > >> tl;dr: application logic does not need or want to deal with > >> code points. For speed units work, and for correctness, > >> graphemes work. > > > > I think that it's pretty clear that code points are objectively > > the worst level to be the default. Unfortunately, changing it > > to _anything_ else is not going to be an easy feat at this > > point. But if we can first ensure that Phobos in general > > doesn't rely on it (i.e. in general, it can deal with ranges of > > char, wchar, dchar, or graphemes correctly rather than assuming > > that all ranges of characters are ranges of dchar), then maybe > > we can figure something out. Unfortunately, while some work has > > been done towards that, what's mostly happened is that folks > > have complained about auto-decoding without doing much to > > improve the current situation. There's a lot more to this than > > simply ripping out auto-decoding even if every D user on the > > planet agreed that outright breaking almost every existing D > > program to get rid of auto-decoding was worth it. But as with > > too many things around here, there's a lot more talking than > > working. And actually, as such, I should probably stop > > discussing this and go do something useful. > > > > - Jonathan M Davis > > Is there a unittest somewhere in phobos you know that one can be > pointed to that shows the handling of these 4 variations you say > should be dealt with first? Or maybe a PR that did some of this > work that one could investigate? > > I ask so I can see in code what it means to make something not > rely on autodecoding and deal with ranges of char, wchar, dchar > or graphemes. > > Or a current "easy" bugzilla issue maybe that one could try a > hand at?
Not really. The handling of this has generally been too ad-hoc. There are plenty of examples of handling different string types, and there are a few handling different ranges of character types, but there's a distinct lack of tests involving graphemes. And the correct behavior for each is going to depend on what exactly the function does - e.g. almost certainly, the correct thing for filter to do is to not do anything special for ranges of characters at all and just filter on the element type of the range (even though it would almost always be incorrect to filter a range of char unless it's known to be all ASCII), while on the other hand, find is clearly designed to handle different encodings. So, it needs to be able to find a dchar or grapheme in a range of char. And of course, there's the issue of how normalization should be handled (if at all). A number of the tests in std.utf and std.string do a good job of testing Unicode strings of varying encodings, and std.utf does a good job overall of testing ranges of char, wchar, and dchar which aren't strings, but I'm not sure that anything in Phobos outside of std.uni currently does anything with ranges of graphemes. std.conv.to does have some tests for ranges of char, wchar, and dchar due to a bug fix. e.g. // bugzilla 15800 @safe unittest { import std.utf : byCodeUnit, byChar, byWchar, byDchar; assert(to!int(byCodeUnit("10")) == 10); assert(to!int(byCodeUnit("10"), 10) == 10); assert(to!int(byCodeUnit("10"w)) == 10); assert(to!int(byCodeUnit("10"w), 10) == 10); assert(to!int(byChar("10")) == 10); assert(to!int(byChar("10"), 10) == 10); assert(to!int(byWchar("10")) == 10); assert(to!int(byWchar("10"), 10) == 10); assert(to!int(byDchar("10")) == 10); assert(to!int(byDchar("10"), 10) == 10); } but there are no grapheme tests, and no Unicode characters are involved (though I'm not sure that much in std.conv really needs to worry about Unicode characters). So, there are tests scattered all over the place which do pieces of what they need to be doing, but I'm not sure that there are currently any that test the full range of character ranges that they really need to be testing. As with testing reference type ranges, such tests have generally been added only when fixing a specific bug, and there hasn't been a sufficient effort to just go through all of the affected functions and add appropriate tests. And unfortunately, unlike with reference type ranges, the correct behavior of a function when faced with ranges of different character types is going to be highly dependent on what they do. Some of them shouldn't be doing anything special for processing ranges of characters, some shouldn't be doing anything special for processing arbitrary ranges of characters, but they still need to do something special for strings because of efficiency issues caused by auto-decoding, and yet others need to actually take Unicode into account and operate on each range type differently depending on whether it's a range of code units, code points, or graphemes. So, completely aside from auto-decoding issues, it's a bit of a daunting task. I keep meaning to take the time to work on it, I've done some of the critical work for supporting arbitrary ranges of char, wchar, and dchar rather than just string types (as have some other folks), but I haven't spent the time to start going through the functions one by one and add the appropriate tests and fixes, and no one else has gone that far either. So, I can't really point towards a specific set of tests and say "here, do what these do." And even if I could, whether what those tests do would be correct for another function would depend on what the functions do. So, sorry that I can't be more helpful. Actually, what you could probably do if you're looking for something related to this to do, and you don't feel that you know enough to just start adding tests, you could try byCodeUnit, byDchar, and byGrapheme with various functions and see what happens. If the function doesn't even compile (which will probably be the case at least some of the time), then that's an easy bug report. If the function does compile, then it will require a greater understanding to know whether it's doing the right thing, but in at least some cases, it may be obvious, and if the result is obviously wrong, you can create a bug report for that. Ultimately though, a pretty solid understanding of ranges and Unicode is going to be required to write a lot of these tests. And worse, a pretty solid understanding of ranges and Unicode is going to be required to use any of these functions correctly even if they all work correctly and have all of the necessary tests to prove it. Unicode is just plain too complicated, and trying to make things "just work" with it is frequently difficult - especially if efficiency matters, but even when efficiency doesn't matter, it's not always obvious how to make it "just work." :( - Jonathan M Davis