Thank you very much, I think this will help with the issue of identifying non-latin letters.
Nevertheless I wonder how words containing non-latin characters are stored in the Lucene index. I got a word containing the German umlaut '�' (00 FC) and the letter is stored as two strange letters (00 C3 00 BC) in the index file I looked at (_1.fdt). Why is that ? Regards, Matthias -----Urspr�ngliche Nachricht----- Von: MOYSE Gilles (Cetelem) [mailto:[EMAIL PROTECTED] Gesendet: Dienstag, 14. Oktober 2003 15:14 An: 'Lucene Users List' Betreff: RE: Indexing UTF-8 and lexical errors Hi. You should edit the StandardTokenizer.jj file. It contains all the definitions to generate the StandardTokenizer.java class, that you certainly use. At the end of the StandardTokenizer.jj file, you'll find the definition of the LETTER token. You'll see all the accepted letters, in Unicode. If you want a table of the different Unicodes, go there : http://www.alanwood.net/unicode/ In the LETTER token definition in the .jj file, unicode are coded as ranges (like "\u0030"-"\u0039") or as elements (like "\u00f1"). Adding the Arabic unicode ranges in this part may solve your problem (add a line like "\u0600"-"\u06FF", since 0600-06FF is the range for arabic characters) Once modified, go to the root of your Lucene installation, and recompile the StandardTokenizer.jj file with : ant compile It should generate the java files (and even compile if I remember well) Good Luck Gilles Moyse -----Message d'origine----- De : [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] Envoy� : mardi 14 octobre 2003 12:07 � : [EMAIL PROTECTED] Objet : Indexing UTF-8 and lexical errors I am trying to index UTF-8 encoded HTML files with content in various languages with Lucene. So far I always receive a message "Parse Aborted: Lexical error at line 146, column 79. Encountered: "\u2013" (8211), after : "" " when trying to index files with Arabic words. I am aware of the fact that tokenizing/analyzing/stemming non-latin characters has some issues but for me tokenizing would be enough. And that should work with Arabic, Russian etc. shouldn't it ? So, what steps do I have to take to make Lucene index non-latin languages/characters encoded in UTF-8 ? Thank you very much, Matthias --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
