FW: CJK support in lucene

Eric Isakson Wed, 16 Jul 2003 11:03:55 -0700


-----Original Message-----
From: Eric Isakson 
Sent: Wednesday, July 16, 2003 2:04 PM
To: 'Avnish Midha'
Subject: RE: CJK support in lucene



I'm no linguist, so the short answer is, I'm not sure about Taiwanese. If they share 
the same character sets and a bigram indexing approach makes sense for that language 
(read the links in the CJKTokenizer source), then it would probably work.

For Latin-1 languages, it will tokenize (It is setup to deal with mixed language 
documents where some of the text might be Chinese and some might be English) but it 
will be far less efficient than the standard tokenizer supplied with the Lucene core. 
But you should run your own tests to see if that would be livable.

Eric

-----Original Message-----
From: Avnish Midha [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, July 16, 2003 1:50 PM
To: Eric Isakson
Cc: Lucene Users List
Subject: RE: CJK support in lucene



Eric,

Does this tokenizer also support Taiwanese & European languages (Latin-1)?

Regards,
Avnish

-----Original Message-----
From: Eric Isakson [mailto:[EMAIL PROTECTED]
Sent: Wednesday, July 16, 2003 10:38 AM
To: Avnish Midha
Cc: Lucene Users List
Subject: RE: CJK support in lucene


This archived message has the CJKTokenizer code attached (there are some links in the 
code to material that describes the tokenization strategy).

http://nagoya.apache.org/eyebrowse/[EMAIL PROTECTED]
e.org&msgId=330905

You have to write your own analyzer that uses this tokenizer. See 
http://www.onjava.com/pub/a/onjava/2003/01/15/lucene.html for some details on how to 
write an analyzer.

here is one you could use:
package my.package;

import org.apache.lucene.analysis.Analyzer;
import org.apache.lucene.analysis.StopFilter;
import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.cjk.CJKTokenizer;
import java.io.Reader;

public class CJKAnalyzer extends Analyzer {
    
    public CJKAnalyzer() {
    }
    
    /**
     * Creates a TokenStream which tokenizes all the text in the provided Reader.
     *
     * @return  A TokenStream built from a CJKTokenizer
     */
    public TokenStream tokenStream( String fieldName, Reader reader )
    {
        TokenStream result = new CJKTokenizer( reader );
                result = new StopFilter(result, new String[] {""}); // CJKTokenizer 
emitts a "" sometimes, haven't been able to figure it out, so this is a workaround
        return result;
    }
}

Lastly, you have to package those things up and use them along with the core lucene 
code.

CC'ing this to Lucene User so everyone can benefit from these answers. Maybe a faq on 
indexing CJK languages would be a good thing to add. The existing one 
(http://lucene.sourceforge.net/cgi-bin/faq/faqmanager.cgi?file=chapter.index
ing&toc=faq#q28) is somewhat light on details (so is this answer, but it is a bit more 
direct about dealing with CJK) and http://www.jguru.com/faq/view.jsp?EID=1011118 is 
useful to be aware of too.

Good luck,
Eric

-----Original Message-----
From: Avnish Midha [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, July 16, 2003 1:06 PM
To: Eric Isakson
Subject: CJK support in lucene



Hi Eric,

I read the description of the bug (#18933) reported by you on the apache site. I had a 
question related to this defect. In the description you have mentioned that CJK 
support should be included in the core build. Is there any other way we can enable the 
CJK support in the lucene search engine? Would be grateful to you if you could let me 
know of any such method of enabling CJK support in the serach engine.

Eagerly waiting for your reply.

Thanks & Regards,
Avnish Midha
Phone no.: +1-949-8852540




---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

FW: CJK support in lucene

Reply via email to