Re: Range of chars (narrow string ranges)

Chris via Digitalmars-d Wed, 29 Apr 2015 09:07:10 -0700

On Wednesday, 29 April 2015 at 15:13:15 UTC, Jonathan M Daviswrote:

On Wednesday, 29 April 2015 at 10:02:09 UTC, Chris wrote:
This sounds like a good starting point for a transition plan.One important thing, though, would be to do some benchmarkingwith and without autodecoding, to see if it really boostsperformance in a way that would justify the transition.
Well, personally, I think that it's worth it even if theperformance is identical (and it's a guarantee that it's goingto be better without autodecoding - it's just a question of howmuch better - since it's going to have less work to do withoutautodecoding). Simply operating at the code point level like wedo now is the worst of all worlds in terms of flexibility andcorrectness. As long as the Unicode is normalized, operating atthe code unit level is the most efficient, and decoding isoften unnecessary for correctness, and if you need to decode,then you really need to go up to the grapheme level in order tobe operating on the full character, meaning that operating oncode points really has the same problems as operating on codeunits as far as correctness goes. So, it's less performantwithout actually being correct. It just gives the illusion ofcorrectness.
By treating strings as ranges of code units, you don't take aperformance hit when you don't need to, and it forces you toactually consider something like byDchar or byGrapheme if youwant to operate on full, Unicode characters. It's similar tohow operating on UTF-16 code units as if they were characters(as Java and C# generally do) frequently gives the incorrectimpression that you're handling Unicode correctly, because youhave to work harder at coming up with characters that can't fitin a single code unit, whereas with UTF-8, anything but ASCIIis screwed if you treat code units as code points. Treatingcode points as if they were full characters like we're doingnow in Phobos with ranges just makes it that much harder tonotice that you're not handling Unicode correctly.
Also, treating strings as ranges of code units makes it so thatthey're not so special and actually are treated like everyother type of array, which eliminates a lot of the specialcasing that we're forced to do right now, and it eliminates allof the confusion that folks keep running into when stringdoesn't work with many functions, because it's not arandom-access range or doesn't have length, or because theresulting range isn't the same type (copy would be a primeexample of a function that doesn't work with char[] when itshould). By leaving in autodecoding, we're basically leaving intechnical debt in D permanently. We'll forever have to beexplaining it to folks and forever have to be working around itin order to achieve either performance or correctness.
What we have now isn't performant, correct, or flexible, andwe'll be forever paying for that if we don't get rid ofautodecoding.
I don't criticize Andrei in the least for coming up with it,since if you don't take graphemes into account (and he didn'tknow about them at the time), it seems like a great idea andallows us to be correct by default and performant if we putsome effort into, but after having seen how it's worked out,how much code has to be special-cased, how much confusion thereis over it, and how it's not actually correct anyway, I thinkthat it's quite clear that autodecoding was a mistake. And atthis point, it's mainly a question of how we can get rid of itwithout being too disruptive and whether we can convince Andreithat it makes sense to make the change, since he seems to stillthink that autodecoding is fine in spite of the fact that it'sneither performant nor correct.
It may be that the decision will be that it's too disruptive toremove autodecoding, but I think that that's really a questionof whether we can find a way to do it that doesn't break tonsof code rather than whether it's worth the performance orcorrectness gain.
- Jonathan M Davis

Ok, I see. Well, if we don't want to repeat C++'s mistakes, weshould fix it before it's too late. Since I'm dealing a lot withstrings (non ASCII) and depend on Unicode (and correctness!), Iwould be more than happy to test any changes to Phobos with myprograms to see if it screws up anything.

Re: Range of chars (narrow string ranges)

Reply via email to