On 3/7/2014 6:33 PM, H. S. Teoh wrote:
On Fri, Mar 07, 2014 at 11:13:50PM +0000, Sarath Kodali wrote:
On Friday, 7 March 2014 at 22:35:47 UTC, Sarath Kodali wrote:

+1
In Indian languages, a character consists of one or more UNICODE
code points. For example, in Sanskrit "ddhrya"
http://en.wikipedia.org/wiki/File:JanaSanskritSans_ddhrya.svg
consists of 7 UNICODE code points. So to search for this char I
have to use string search.

- Sarath

Oops, incomplete reply ...

Since a single "alphabet" in Indian languages can contain multiple
code-points, iterating over single code-points is like iterating
over char[] for non English European languages. So decode is of no
use other than decreasing the performance. A raw char[] comparison
is much faster.

Yes. The more I think about it, the more auto-decoding sounds like a
wrong decision. The question, though, is whether it's worth the massive
code breakage needed to undo it. :-(


I'm leaning the same way too. But I also think Andrei is right that, at this point in time, it'd be a terrible move to change things so that "by code unit" is default. For better or worse, that ship has sailed.

Perhaps we *can* deal with the auto-decoding problem not by killing auto-decoding, but by marginalizing it in an additive way:

Convincing arguments have been made that any string-processing code which *isn't* done entirely with the official Unicode algorithms is likely wrong *regardless* of whether std.algorithm defaults to per-code-unit or per-code-point.

So...How's this?: We add any of these Unicode algorithms we may be missing, encourage their use for strings, discourage use of std.algorithm for string processing, and in the meantime, just do our best to reduce unnecessary decoding wherever possible. Then we call it a day and all be happy :)

Reply via email to