On Sat, 6 Jul 2002, C Bobroff wrote:

> BTW, I can understand the heh-waw reversal and the 4 extra Persian letters
> being dumped at the end, but please tell me, why is the "kaf"  out of
> order?

The point is: The difference is not only 4 letters, but 6. The codes for
Persian Kaf and Yeh (called "Keheh" and "Farsi Yeh" in Unicode) are also
different from their Arabic friends. They also appear at the end.

Actually, these are not the only things needed to get proper Persian
sorting. You should also think about getting Teh Marbuta sorted with Heh.
Also, all Hamza forms (Hamza, Alef With Hamza Above, Alef With Hamza
Below, Waw With Hamza Above, Yeh With Hamza Above) should sort equally at 
the first level, between Alef and Beh: if two different words became 
equal then, like "mo'men" (Meem, Waw-Hamza, Meem, Noon) and "ma'man" 
(Meem, Alef-Hamza, Meem, Noon), you should now consider their difference.

It will get more complicated if you consider Fatha, Kasra, etc, and then
punctuation, but let's forget now for the moment. Proper sorting is
considered a four-level at minimum process, even with English text. But
that may not be enough, and some sophisticated preprocessing is also
needed: I remember an exercise from Knuth's The Art of Computer
Programming, asking to implement how librarians do sorting: They sort
"2001: A Space Odyssey" in "T", for a start.

roozbeh

_______________________________________________
FarsiWeb mailing list
[EMAIL PROTECTED]
http://lists.sharif.edu/mailman/listinfo/farsiweb

Reply via email to