Re: Indexing UTF-8 and lexical errors

Matthias Krueger Tue, 14 Oct 2003 07:26:41 -0700


Thank you very much, I think this will help with the issue of identifying
non-latin letters.


Nevertheless I wonder how words containing non-latin characters are stored
in the Lucene index. I got a word containing the German umlaut '�' (00 FC)
and the letter is stored as two strange letters (00 C3 00 BC) in the index
file I looked at (_1.fdt). Why is that ?

Regards,
Matthias


-----Urspr�ngliche Nachricht-----
Von: MOYSE Gilles (Cetelem) [mailto:[EMAIL PROTECTED]
Gesendet: Dienstag, 14. Oktober 2003 15:14
An: 'Lucene Users List'
Betreff: RE: Indexing UTF-8 and lexical errors


Hi.

You should edit the StandardTokenizer.jj file. It contains all the
definitions to generate the StandardTokenizer.java class, that you certainly
use.
At the end of the StandardTokenizer.jj file, you'll find the definition of
the LETTER token. You'll see all the accepted letters, in Unicode. If you
want a table of the different Unicodes, go there :
http://www.alanwood.net/unicode/
In the LETTER token definition in the .jj file, unicode are coded as ranges
(like "\u0030"-"\u0039") or as elements (like "\u00f1").
Adding the Arabic unicode ranges in this part may solve your problem (add a
line like "\u0600"-"\u06FF", since 0600-06FF is the range for arabic
characters)

Once modified, go to the root of your Lucene installation, and recompile the
StandardTokenizer.jj file with :
        ant compile
It should generate the java files (and even compile if I remember well)

Good Luck

Gilles Moyse

-----Message d'origine-----
De : [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]
Envoy� : mardi 14 octobre 2003 12:07
� : [EMAIL PROTECTED]
Objet : Indexing UTF-8 and lexical errors



I am trying to index UTF-8 encoded HTML files with content in various
languages with Lucene. So far I always receive a message

"Parse Aborted: Lexical error at line 146, column 79.
Encountered: "\u2013" (8211), after : "" "

when trying to index files with Arabic words. I am aware of the fact
that tokenizing/analyzing/stemming non-latin characters has some issues
but for me tokenizing would be enough. And that should work with Arabic,
Russian etc. shouldn't it ?

So, what steps do I have to take to make Lucene index non-latin
languages/characters encoded in UTF-8 ?

Thank you very much,
Matthias


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: Indexing UTF-8 and lexical errors

Reply via email to