On Friday, 7 March 2014 at 22:35:47 UTC, Sarath Kodali wrote:
+1
In Indian languages, a character consists of one or more
UNICODE code points. For example, in Sanskrit "ddhrya"
http://en.wikipedia.org/wiki/File:JanaSanskritSans_ddhrya.svg
consists of 7 UNICODE code points. So to search for this char I
have to use string search.
- Sarath
Oops, incomplete reply ...
Since a single "alphabet" in Indian languages can contain
multiple code-points, iterating over single code-points is like
iterating over char[] for non English European languages. So
decode is of no use other than decreasing the performance. A raw
char[] comparison is much faster.
And then there is this "unicode normalization" that makes it very
difficult for string searches or comparisons.
- Sarath