> Here is a solution (in fact a hack) that if implemented correctly, can > resolve some of the issues till people and Google start using correct > software: > > With a little tweaking, the web servers can translate the correct > Unicode to the incorrect unicode desired so much by the Win9X users. > That is, the web severs looks at the browser request, and if it can > detect Win9X, translates all U+06CC's in the document to U+064A (and > all other required translations). The same technique could be used to > fool google into generating correct search results. That, is the web > server generates a Win9X friendly version of the document and appends > it to the original document. You can also allocate tags that the user > of the web server can disable or enable some of these features. This > may even make one gain some advatnage over other web hosting > companies.
That solves half of the problem. On Win9x, the key d on the keyboard inserts an Arabic YEH, and on Win2K+, it inserts FARSI YEH. So, if you use this method, when a user types in a word containing yeh in the google's search box on Win9x, they wouldn't find your site. The best hack (or solution, as one might call it) I've found for this is feeding a version of page too Google which contains both forms of words (using YEH and FARSI YEH) so that the chances of google finding your page for a certain keyword gets maximized. Of course, certain measures must be taken to prevent bad results, for example, the proximity of the words must not get touched. Nevertheless, this will cause other problems, such as malformed keyword density, which cannot be solved reliably. The problem must be fixed in the search engine code, really, and such hacks have their own downsides. The search engine project I've been working on <www.ariasearch.com> handles this (and the ARABIC KEHEH and FARSI KEH problem) among other problems for searching in Persian text. > Of course, the solution above is only a transient one, and it is up to > people to upgrade their Win9X machines to something that is > Unicode-compliant, also it is up to Google to program their systems > such that it can understand that both U+06CC and U+064A are the same > shape and hence should be regarded the same for searching unless user > requests otherwise. This is the same as case-insensitive search that > is usually implemented by mapping all upper and lower case characters > -- in documents and queries alike -- to uppercase. Yeah that's right. Of course great attention must be paid so that it doesn't break Arabic search results. ------------- Ehsan Akhgari Farda Technology (http://www.farda-tech.com/) List Owner: [EMAIL PROTECTED] [ Email: [EMAIL PROTECTED] ] [ WWW: http://www.beginthread.com/Ehsan ] He who sees the abyss, but with eagle's eyes - he who with eagle's talons grasps the abyss: he has courage. -Thus Spoke Zarathustra, F. W. Nietzsche _______________________________________________ PersianComputing mailing list [EMAIL PROTECTED] http://lists.sharif.edu/mailman/listinfo/persiancomputing