Indexing UTF-8 and lexical errors

Matthias Krueger Tue, 14 Oct 2003 03:05:54 -0700

I am trying to index UTF-8 encoded HTML files with content in various
languages with Lucene. So far I always receive a message


"Parse Aborted: Lexical error at line 146, column 79.
Encountered: "\u2013" (8211), after : "" "

when trying to index files with Arabic words. I am aware of the fact
that tokenizing/analyzing/stemming non-latin characters has some issues
but for me tokenizing would be enough. And that should work with Arabic,
Russian etc. shouldn't it ?

So, what steps do I have to take to make Lucene index non-latin
languages/characters encoded in UTF-8 ?

Thank you very much,
Matthias


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Indexing UTF-8 and lexical errors

Reply via email to