On Wednesday, March 07, 2018 13:40:20 Nick Treleaven via Digitalmars-d wrote: > On Wednesday, 7 March 2018 at 13:24:25 UTC, Jonathan M Davis > > wrote: > > I'd actually argue that that's the lesser of the problems with > > auto-decoding. The big problem is that it's auto-decoding. Code > > points are almost always the wrong level to be operating at. > > For me the fundamental problem is having char in the language > at all, meaning a Unicode string. Arbitrary slicing and indexing > are not Unicode compatible, if we revisit this we need a String > type that doesn't support those operations. Plus the issue of > string validation - a Unicode string type should be assumed to > have valid contents - unsafe data should only be checked at > string construction time, so iterating should always be nothrow.
In principle, char is supposed to be a UTF-8 code unit, and strings are supposed to be validated up front rather than constantly validated, but it's never been that way in practice. Regardless, having char be sliceable is actually perfectly fine and desirable. That's exactly what you want whenever you operate on code units, and it's frequently the case that you want to be operating at the code unit level. But the programmer needs to be able to reasonably control when code units, code points, or graphemes are used, because each has their time and place. If we had a string type, it would need to provide access to each of those levels and likely would not be directly sliceable at all, because slicing a string is kind of meaningless, because in principle, a string is just on opaque piece of character data. It's when you're dealing at the code unit, code point, or grapheme level that you actually start operating on pieces of a string, and that means that the level that you're operating at needs to be defined. Having char be an array of code units works quite well, because then you have efficiency by default. You then need to wrap it in another range type when appropriate to get a range of code points or graphemes, or you need to explicitly decode when appropriate. Whereas right now, what we have is Phobos being "helpful" and constantly decoding for us such that we get needlessy inefficient code, and it's at the code point level, which is usually not the level you want to operate at. So, you don't have efficiency or correctness. Ultimately, it really doesn't work to hide the details of Unicode and not have the programmer worry about code units, code points, and graphemes unless you don't care about efficency. As such, what we really need is to cleanly give the programmer the tools to manage Unicode without the language or library assuming what the programmer wants - especially assuming an inefficient default. The language itself actually does a decent job of that. It's Phobos that dropped the ball on that one, because Andrei didn't know about graphemes and tried to make Phobos Unicode-correct by default. Instead, we get inefficient and incorrect by defaltu. - Jonathan M Davis