There are a bunch of other issues... I should have qualified that. There really aren't any issues with the Lucene core to support Japanese, just other issues in my app that uses Lucene and working with my content providers to ensure consistent use of encodings, etc.
I have found what I think is a bug in the CJKTokenizer in that it emits an empty string token after processing my japanese characters. I haven't found the bug in CJKTokenizer yet, but as a workaround I'm using a StopFilter that removes it. -----Original Message----- From: Eric Isakson [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 18, 2003 11:52 AM To: Lucene Users List Subject: RE: Indexing and searching non-latin languages using utf-8 Have you verified that your form inputs are getting to your query objects without the String being mangled due to encoding problems? I'm getting japanese in UTF-8 and use the technique described at http://w6.metronet.com/~wjm/tomcat/2001/Aug/msg00230.html to get the data from the browser to Lucene. I build my index using the HTMLParser in the lucene demos and give them a Reader object that was created from an InputStreamReader that specifies the HTML file encodings (Shift_jis in my case). There are a bunch of other issues I'm working on to support Japanese, but I'm getting search results at this point. The two places that encodings should come into play for you are parsing your source content into the Reader or String that you use to create org.apache.lucene.document.Field objects and getting the user query from their browser to the Query objects. Eric -- Eric D. Isakson SAS Institute Inc. Application Developer SAS Campus Drive XML Technologies Cary, NC 27513 (919) 531-3639 http://www.sas.com -----Original Message----- From: MERCIER ALEXANDRE [mailto:[EMAIL PROTECTED] Sent: Tuesday, March 18, 2003 11:36 AM To: [EMAIL PROTECTED] Subject: Indexing and searching non-latin languages using utf-8 Hi all, I've a matter with indexing then searching docs written in non-latin languages and encoded in utf-8 (Russian, by example). I have a web application, with a simple form to search in the contents of the docs. When I submit the form, I encode the query term in utf-8 with encodeURI(String) but I match no doc. I think that is due to a bad indexing but I'm not sure. Lucene is normally indexing docs in writing Terms in the 'xxx.tis' file, encoding it in utf-8, I believe. So when it reads the file, it correctly gets russian characters (2 bytes) but when writing them in the index, they seem different (I've listed the terms in my application console). If someone has a solution to resolve my problem, all advices are welcome. Thanks. Alex --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
