On Friday, 7 March 2014 at 22:35:47 UTC, Sarath Kodali wrote:

+1
In Indian languages, a character consists of one or more UNICODE code points. For example, in Sanskrit "ddhrya" http://en.wikipedia.org/wiki/File:JanaSanskritSans_ddhrya.svg consists of 7 UNICODE code points. So to search for this char I have to use string search.

- Sarath

Oops, incomplete reply ...

Since a single "alphabet" in Indian languages can contain multiple code-points, iterating over single code-points is like iterating over char[] for non English European languages. So decode is of no use other than decreasing the performance. A raw char[] comparison is much faster.

And then there is this "unicode normalization" that makes it very difficult for string searches or comparisons.

- Sarath

Reply via email to