RE: Indexing and searching non-latin languages using utf-8

Eric Isakson Tue, 18 Mar 2003 09:18:11 -0800

There are a bunch of other issues... I should have qualified that. There really aren't 
any issues with the Lucene core to support Japanese, just other issues in my app that 
uses Lucene and working with my content providers to ensure consistent use of 
encodings, etc.

I have found what I think is a bug in the CJKTokenizer in that it emits an empty 
string token after processing my japanese characters. I haven't found the bug in 
CJKTokenizer yet, but as a workaround I'm using a StopFilter that removes it.

-----Original Message-----
From: Eric Isakson [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 18, 2003 11:52 AM
To: Lucene Users List
Subject: RE: Indexing and searching non-latin languages using utf-8

Have you verified that your form inputs are getting to your query objects without the 
String being mangled due to encoding problems?

I'm getting japanese in UTF-8 and use the technique described at 
http://w6.metronet.com/~wjm/tomcat/2001/Aug/msg00230.html to get the data from the 
browser to Lucene. I build my index using the HTMLParser in the lucene demos and give 
them a Reader object that was created from an InputStreamReader that specifies the 
HTML file encodings (Shift_jis in my case).

There are a bunch of other issues I'm working on to support Japanese, but I'm getting 
search results at this point.

The two places that encodings should come into play for you are parsing your source 
content into the Reader or String that you use to create 
org.apache.lucene.document.Field objects and getting the user query from their browser 
to the Query objects.

Eric
--
Eric D. Isakson        SAS Institute Inc.
Application Developer  SAS Campus Drive
XML Technologies       Cary, NC 27513
(919) 531-3639         http://www.sas.com

-----Original Message-----
From: MERCIER ALEXANDRE [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, March 18, 2003 11:36 AM
To: [EMAIL PROTECTED]
Subject: Indexing and searching non-latin languages using utf-8

Hi all,

I've a matter with indexing then searching docs written in non-latin languages and 
encoded in utf-8 (Russian, by example).

I have a web application, with a simple form to search in the contents of the docs. 
When I submit the form, I encode the query term in utf-8 with
encodeURI(String) but I match no doc. I think that is due to a bad indexing but I'm 
not sure.

Lucene is normally indexing docs in writing Terms in the 'xxx.tis' file, encoding it 
in utf-8, I believe. So when it reads the file, it correctly gets russian characters 
(2 bytes) but when writing them in the index, they seem different (I've listed the 
terms in my application console).

If someone has a solution to resolve my problem, all advices are welcome.

Thanks.
Alex

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

RE: Indexing and searching non-latin languages using utf-8

Reply via email to