I am trying to index UTF-8 encoded HTML files with content in various languages with Lucene. So far I always receive a message
"Parse Aborted: Lexical error at line 146, column 79. Encountered: "\u2013" (8211), after : "" " when trying to index files with Arabic words. I am aware of the fact that tokenizing/analyzing/stemming non-latin characters has some issues but for me tokenizing would be enough. And that should work with Arabic, Russian etc. shouldn't it ? So, what steps do I have to take to make Lucene index non-latin languages/characters encoded in UTF-8 ? Thank you very much, Matthias --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
