On Monday, March 28, 2016 22:34:31 Jack Stouffer via Digitalmars-d-learn wrote: > void main () { > import std.range.primitives; > char[] val = ['1', '0', 'h', '3', '6', 'm', '2', '8', 's']; > pragma(msg, ElementEncodingType!(typeof(val))); > pragma(msg, typeof(val.front)); > } > > prints > > char > dchar > > Why?
assert(typeof(ElementType!(typeof(val)) == dchar)); The range API considers all strings to have an element type of dchar. char, wchar, and dchar are UTF code units - UTF-8, UTF-16, and UTF-32 respectively. One or more code units make up a code point, which is actually something displayable but not necessarily what you'd call a character (e.g. it could be an accent). One or more code points then make up a grapheme, which is really what a displayable character is. When Andrei designed the range API, he didn't know about graphemes - just code units and code points, so he thought that code points were guaranteed to be full characters and decided that that's what we'd operate on for correctness' sake. In the case of UTF-8, a code point is made up of 1 - 4 code units of 8 bits each. In the case of UTF-16, a code point is mode up of 1 - 2 code units of 16 bits each. And in the case of UTF-32, a code unit is guaranteed to be a single code point. So, by having the range API decode UTF-8 and UTF-16 to UTF-32, strings then become ranges of dchar and avoid having code points chopped up by stuff like slicing. So, while a code point is not actually guaranteed to be a full character, certain classes of bugs are prevented by operating on ranges of code points rather than code units. Of course, for full correctness, graphemes need to be taken into account, and some algorithms generally don't care whether they're operating on code units, code points, or graphemes (e.g. find on code units generally works quite well, whereas something like filter would be a complete disaster if you're not actually dealing with ASCII). Arrays of char and wchar are termed "narrow strings" - hence isNarrowString is true for them (but not arrays of dchar) - and the range API does not consider them to have slicing, be random access, or have length, because as ranges of dchar, those operations would be O(n) rather than O(1). However, because of this mess of whether an algorithm works best when operating on code units or code points and the desire to avoid decoding to code points if unnecessary, many algorithms special case narrow strings in order to operate on them more efficiently. So, ElementEncodingType was introduced for such cases. ElementType gives you the element type of the range, and for everythnig but narrow strings ElementEncodingType is the same as ElementType, but in the case of narrow strings, whereas ElementType is dchar, ElementEncodingType is the actual element type of the array - hence why ElementEncodingType(typeof(val)) is char in your code above. The correct way to deal with this is really to understand Unicode well enough to know when you should be dealing at the code unit, code point, or grapheme level and write your code accordingly, but that's not exactly easy. So, in some respects, just operating on strings as dchar simplifies things and reduces bugs relating to breaking up code points, but it does come with an efficiency cost, and it does make the range API more confusing when it comes to operating on narrow strings. And it isn't even fully correct, because it doesn't take graphemes into account. But it's what we're stuck with at this point. std.utf provides byCodeUnit and byChar to iterate by code unit or specific character types, and std.uni provides byGrapheme for iterating by grapheme (along with plenty of other helper functions). So, the tools to deal with range s of characters more precisely are there, but they do require some understanding of Unicode, and they don't always interact with the rest of Phobos very well, since they're newer (e.g. std.conv.to doesn't fully work with byCodeUnit yet, even though it works with ranges of dchar just fine). - Jonathan M Davis