I'm indexing arabic in my index and to make it searchable I had to switch character sets (not fun) the problem lies in the week standards surrounding Arabic Character sets between ISO 8895-6 , win-1256 and UTF-8 you can have three different representations of the same exact thing UTF-8 store arabic in numeric form ( the code that represent each letter) the lucene analyzer isn't to friendly with numbers and especially if you use a stemmer. When it comes to the other two encodings they are different but both come back to the same results lucene views them as if they were European character sets and tries to apply the same rules to them so take care when you're indexing arabic, I only figured it out when I started experimenting with different unix charset settings while encoding because I have an oracle DB that spits out the XML files on a Solaris os and then lucene picks them up for encoding and since my core application isn't in java I have to contend with two web servers Main application ( AOL server ) and then search application (Lucene on Resin).
When trying to figure out encoding issues, you need to convert everything to it's most simple form and compare and contrast as it passes through your application. Nader -----Original Message----- From: W. Eliot Kimber [mailto:[EMAIL PROTECTED]] Sent: Friday, June 28, 2002 6:59 PM To: Lucene Users List Subject: Re: Internationalization - Arabic Language Support Peter Carlson wrote: > The biggest part that is usually changed per language is the analyzer. This > is the part of Lucene which transforms and breaks up a string into distinct > terms. I have only the smallest understanding of Arabic as a language, but I have done some work to implement back-of-the-book indexing of Arabic (and other languages) for XSL/XSLT. Based on that experience, I think that the main challenges in implementing an Arabic analyzer would be: 1. Understanding the stemming rules for Arabic. Our research into Arabic collation revealed that the rules for how Arabic words are formed is not nearly as simple as in English and other Western languages. At this point we haven't stepped up to trying to implement (or find an implementation for) Arabic stemming for collation (words are collated first by their roots, which are not necessarily at the start of the words, so simple lexical collation won't work for Arabic and I'm assuming that full-text indexing by word roots would have the same problem). So I don't know more than that the problem is hard, even for native speakers of Arabic. 2. Handling different letter forms in queries--Semitic languages often have different forms for the same abstract character for different positions in a word: initial forms, final forms, and base forms. These different forms have different Unicode code points (although initial and final forms are identified as such in the Unicode database). Often a word will be stored with the base forms but the presented word will be transformed to use the appropriate initial or final form. This means, for example, that cutting and pasting a word from, say, a PDF document into a query might require rationalization of variant forms to base forms before performing the search (assuming that the analyzer also reduces all letters to their base forms for indexing). 3. Right-to-left entry of queries and presentation of results. Mixing right-to-left data with left-to-right data can get pretty tricky at the user interface level (it's not an issue at the data storate level, where all characters are stored in order of occurrence regardless of presentation direction). Good support for bidirectional input and presentation is hit and miss at best. For example, we could not figure out how to get Internet Explorer to correctly present mixed English and Arabic where there were lots of special characters (as opposed to simple flowed prose, which seems to work OK). I would expect Arabic localized Web browsers to handle input OK, but it might be hard to find GUI toolkits that do it well. IBMs ICU4J package, a collection of national language support utilities and libraries, might offer some solutions to this problem but I have not yet investigated its support for Arabic and similar languages (we used it for its Thai word breaker, which would be needed to implement a Thai analyzer for Lucene). Cheers, Eliot -- W. Eliot Kimber, [EMAIL PROTECTED] Consultant, ISOGEN International 1016 La Posada Dr., Suite 240 Austin, TX 78752 Phone: 512.656.4139 -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
