On Sun, Sep 28, 2014 at 12:06:16PM +0000, Uranuz via Digitalmars-d wrote: > On Sunday, 28 September 2014 at 00:13:59 UTC, Andrei Alexandrescu wrote: > >On 9/27/14, 3:40 PM, H. S. Teoh via Digitalmars-d wrote: > >>If we can get Andrei on board, I'm all for killing off autodecoding. > > > >That's rather vague; it's unclear what would replace it. -- Andrei > > I believe that removing autodeconding will make things even worse. As > far as understand if we will remove it from front() function that > operates on narrow strings then it will return just byte of char. I > believe that proceeding on narrow string by `user perceived chars` > (graphemes) is more common use case. [...]
Unfortunately this is not what autodecoding does today. Today's autodecoding only segments strings into code *points*, which are not the same thing as graphemes. For example, combining diacritics are normally not considered separate characters from the user's POV, but they *are* separate codepoints from their base character. The only reason today's autodecoding is even remotely considered "correct" from an intuitive POV is because most Western character sets happen to use only precomposed characters rather than combining diacritic sequences. If you were processing, say, Korean text, the present autodecoding .front would *not* give you what you might imagine is a "single character"; it would only be halves of Korean graphemes. Which, from a user's POV, would suffer from the same issues as dealing with individual bytes in a UTF-8 stream -- any mistake on the program's part in handling these half-units will cause "corruption" of the text (not corruption in the same sense as an improperly segmented UTF-8 byte stream, but in the sense that the wrong glyphs will be displayed on the screen -- from the user's POV these two are basically the same thing). You might then be tempted to say, well let's make .front return graphemes instead. That will solve the "single intuitive character" issue, but the performance will be FAR worse than what it is today. So basically, what we have today is neither efficient nor complete, but a halfway solution that mostly works for Western character sets but is incomplete for others. We're paying efficiency for only a partial benefit. Is it worth the cost? I think the correct solution is not for Phobos to decide for the application at what level of abstraction a string ought to be processed. Rather, let the user decide. If they're just dealing with opaque blocks of text, decoding or segmenting by grapheme is completely unnecessary -- they should just operate on byte ranges as opaque data. They should use byCodeUnit. If they need to work with Unicode codepoints, let them use byCodePoint. If they need to work with individual user-perceived characters (i.e., graphemes), let them use byGrapheme. This is why I proposed the deprecation path of making it illegal to pass raw strings to Phobos algorithms -- the caller should specify what level of abstraction they want to work with -- byCodeUnit, byCodePoint, or byGrapheme. The standard library's job is to empower the D programmer by giving him the choice, not to shove a predetermined solution down his throat. T -- Life is unfair. Ask too much from it, and it may decide you don't deserve what you have now either.
