First of all, I really appreciate your work on Lucene for Korean words, But If we cannot support stem analyzer for Korean words, I think one token for one Korean character is better.
When we search a word, usually we use "검색" not "검색하다". ("하다" is like "ed" of "searched"). If we cannot get any result from "검색", StandardAnalyzer has no meaning to Korean, I may have to go back to use CJKAnalyzer. How about let the StandarAnalyzer be not changed, and add a new Analyzer for Korea words? Thanks. 2005/11/8, Cheolgoo Kang <[EMAIL PROTECTED]>: > On 11/8/05, Youngho Cho <[EMAIL PROTECTED]> wrote: > > Hello, > > > > just simple test ... > > If I compile the javacc correctly.. > > the patched version doesn't match some situation > > for example > > in text > > '엔진박지성(맨체스터 유나이티드)이 주말 프리미어리그를 위해 벤치를 지키며 재충전의 시간을 가졌다.' > > if query word is '시간' than nothing match > > but if query word is '시간을' than good match. > > That's exactly what I've wanted to do with this patch. You need more > sophisticated morphological analyzer to to what you've wanted to. > AFAIK, I'm afraid there is no open sourced software to do the Korean > morphological analysis. > > And also, if you have Lucene in Action Korean translation, there is a > simple introduction to Korean stemming analysis at Appendix D. That's > just simple enough to hold a list of Korean word endings, and check > for each Token with matching word endings. :) > > > > > I think there is some tradeoff here. > > > > Maybe need some good stop filter for korean etc....... > > > > > > Thanks > > > > Youngho > > > > ----- Original Message ----- > > From: "Youngho Cho" <[EMAIL PROTECTED]> > > To: <java-user@lucene.apache.org> > > Sent: Tuesday, November 08, 2005 4:44 PM > > Subject: Re: korean and lucene > > > > > > > Hello Cheolgoo, > > > > > > I will test the patch. > > > > > > > > > Thanks, > > > > > > Youngho > > > > > > ----- Original Message ----- > > > From: "Cheolgoo Kang" <[EMAIL PROTECTED]> > > > To: <java-user@lucene.apache.org>; "Youngho Cho" <[EMAIL PROTECTED]> > > > Sent: Tuesday, November 08, 2005 4:06 PM > > > Subject: Re: korean and lucene > > > > > > > > > > Hello, > > > > > > > > I've created a new JIRA issue with Korean analysis that > > > > StandardAnalyzer splits one word into several tokens each with one > > > > character. Cause Korean is not a phonogram, one character in Korean > > > > has almost no meaning at all. So word in Korean should be preserved > > > > not like Chinese or Japanese. > > > > > > > > I've attached a patch to StandardTokenizer to do this and passed the > > > > test case TestStandardAnalyzer(also patched to test Korean words). > > > > > > > > Thanks! > > > > > > > > On 10/27/05, Youngho Cho <[EMAIL PROTECTED]> wrote: > > > > > Hello, > > > > > > > > > > Ok , I've attached my test code for Korean which is slitely modified > > > > > Koji's code. > > > > > > > > > > Just put into the lia.analysis.i18n package at LuceneInAction > > > > > and run ant. > > > > > > > > > > Hopely someone is helped. > > > > > > > > > > -------- build.xml --------- > > > > > > > > > > <target name="JapaneseDemo" depends="prepare" > > > > > description="Examples of Jananese analysis"> > > > > > <info> > > > > > > > > > > Japanese Test... > > > > > > > > > > </info> > > > > > > > > > > <run-main class="lia.analysis.i18n.JapaneseDemo"/> > > > > > </target> > > > > > > > > > > <target name="KoreanDemo" depends="prepare" > > > > > description="Examples of Korean analysis"> > > > > > <info> > > > > > > > > > > Korean Test... > > > > > > > > > > </info> > > > > > > > > > > <run-main class="lia.analysis.i18n.KoreanDemo"/> > > > > > </target> > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > Youngho > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > From: "Youngho Cho" <[EMAIL PROTECTED]> > > > > > To: <java-user@lucene.apache.org>; "Youngho Cho" <[EMAIL PROTECTED]> > > > > > Sent: Thursday, October 27, 2005 12:47 PM > > > > > Subject: Re: korean and lucene > > > > > > > > > > > > > > > > Hello all > > > > > > Plese forgive me pervious my stupid message > > > > > > > > > > > > [echo] Running lia.analysis.i18n.KoreanDemo... > > > > > > [java] [경] [기] analyzer = > > > > > > org.apache.lucene.analysis.standard.StandardAnalyzer > > > > > > [java] phrase = 경기 > > > > > > [java] query = "경 기" > > > > > > > > > > > > I got the good result. > > > > > > > > > > > > When I compile I just rename old version lucene-1.4.3.jar to > > > > > > lucene-1.4.3.jar_bak > > > > > > and all new 1.9 lucene. and build the test package. > > > > > > After I remove lucene-1.4.3.jar_bak in lib directory completely > > > > > > I got the expected result !!!. > > > > > > > > > > > > I don't know the reason... ( looks like my finger make some > > > > > > trouble... ) > > > > > > > > > > > > Anyway thanks Koji and Cheolgoo > > > > > > I will further test now... > > > > > > > > > > > > Youngho > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > From: "Youngho Cho" <[EMAIL PROTECTED]> > > > > > > To: <java-user@lucene.apache.org> > > > > > > Sent: Thursday, October 27, 2005 12:28 PM > > > > > > Subject: Re: korean and lucene > > > > > > > > > > > > > > > > > > > Hello Koji > > > > > > > > > > > > > > Here is test result. > > > > > > > Japanese is OK !. > > > > > > > maybe ant clean did some effect. > > > > > > > > > > > > > > Anyway please refer to the result using 1.9 > > > > > > > > > > > > > > [echo] Running lia.analysis.i18n.JapaneseDemo... > > > > > > > [java] [ラ] [メ] [ン] [屋] analyzer = > > > > > > > org.apache.lucene.analysis.standard.StandardAnalyzer > > > > > > > [java] phrase = ラ?メン屋 > > > > > > > [java] query = content:ラ?メン屋 > > > > > > > > > > > > > > [echo] Running lia.analysis.i18n.KoreanDemo... > > > > > > > [java] analyzer = > > > > > > > org.apache.lucene.analysis.standard.StandardAnalyzer > > > > > > > [java] phrase = 경 > > > > > > > [java] query = > > > > > > > > > > > > > > [echo] Running lia.analysis.i18n.JapaneseDemo... > > > > > > > [java] [ラ] [メン] [ン屋] analyzer = > > > > > > > org.apache.lucene.analysis.cjk.CJKAnalyzer > > > > > > > [java] phrase = ラ?メン屋 > > > > > > > [java] query = content:ラ?メン屋 > > > > > > > > > > > > > > [echo] Running lia.analysis.i18n.KoreanDemo... > > > > > > > [java] [경] analyzer = > > > > > > > org.apache.lucene.analysis.cjk.CJKAnalyzer > > > > > > > [java] phrase = 경 > > > > > > > [java] query = 경 > > > > > > > > > > > > > > [echo] Running lia.analysis.i18n.KoreanDemo... > > > > > > > [java] analyzer = > > > > > > > org.apache.lucene.analysis.standard.StandardAnalyzer > > > > > > > [java] phrase = 경기 > > > > > > > [java] query = > > > > > > > > > > > > > > [echo] Running lia.analysis.i18n.KoreanDemo... > > > > > > > [java] [경기] analyzer = > > > > > > > org.apache.lucene.analysis.cjk.CJKAnalyzer > > > > > > > [java] phrase = 경기 > > > > > > > [java] query = 경기 > > > > > > > > > > > > > > > > > > > > > Standard analyzer didn't tokenized the Korean Character at all.... > > > > > > > > > > > > > > Ug.... look like > > > > > > > http://issues.apache.org/jira/browse/LUCENE-444 > > > > > > > didn't effect at all for Korean. > > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > Youngho > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > From: "Koji Sekiguchi" <[EMAIL PROTECTED]> > > > > > > > To: <java-user@lucene.apache.org>; "Youngho Cho" <[EMAIL > > > > > > > PROTECTED]> > > > > > > > Sent: Thursday, October 27, 2005 11:47 AM > > > > > > > Subject: RE: korean and lucene > > > > > > > > > > > > > > > > > > > > > > Hello Youngho, > > > > > > > > > > > > > > > > I don't understand why you couldn't get hits result in Japanese, > > > > > > > > though, you had better check why the query was empty with > > > > > > > > Korean data: > > > > > > > > > > > > > > > > > For Korean > > > > > > > > > [echo] Running lia.analysis.i18n.KoreanDemo... > > > > > > > > > [java] phrase = 경 > > > > > > > > > [java] query = > > > > > > > > > > > > > > > > The last line should be query = 경 > > > > > > > > to get hits result. Can you check why StandardAnalyzer > > > > > > > > removes "경" during tokenizing? > > > > > > > > > > > > > > > > Koji > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > > From: Youngho Cho [mailto:[EMAIL PROTECTED] > > > > > > > > > Sent: Thursday, October 27, 2005 11:37 AM > > > > > > > > > To: java-user@lucene.apache.org > > > > > > > > > Subject: Re: korean and lucene > > > > > > > > > > > > > > > > > > > > > > > > > > > Hello Koji, > > > > > > > > > > > > > > > > > > Thanks for your kind reply. > > > > > > > > > > > > > > > > > > Yes, I used QueryParser. normaly I used > > > > > > > > > Query = QueryParser.parse( ) method. > > > > > > > > > > > > > > > > > > I put your sample code into lia.analysis.i18n package in > > > > > > > > > LuceneAction > > > > > > > > > and run JapaneseDemo using 1.4 and 1.9 > > > > > > > > > > > > > > > > > > results are > > > > > > > > > > > > > > > > > > [echo] Running lia.analysis.i18n.JapaneseDemo... > > > > > > > > > [java] query = content:ラ?メン屋 > > > > > > > > > > > > > > > > > > I can't get hits result. > > > > > > > > > > > > > > > > > > For Korean > > > > > > > > > [echo] Running lia.analysis.i18n.KoreanDemo... > > > > > > > > > [java] phrase = 경 > > > > > > > > > [java] query = > > > > > > > > > > > > > > > > > > I can't get query parse result. > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > Youngho > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > From: "Koji Sekiguchi" <[EMAIL PROTECTED]> > > > > > > > > > To: <java-user@lucene.apache.org>; "Youngho Cho" <[EMAIL > > > > > > > > > PROTECTED]> > > > > > > > > > Sent: Thursday, October 27, 2005 9:48 AM > > > > > > > > > Subject: RE: korean and lucene > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi Youngho, > > > > > > > > > > > > > > > > > > > > With regard to Japanese, using StandardAnalyzer, > > > > > > > > > > I can search a word/phase. > > > > > > > > > > > > > > > > > > > > Did you use QueryParser? StandardAnalyzer tokenizes > > > > > > > > > > CJK characters into a stream of single character. > > > > > > > > > > Use QueryParser to get a PhraseQuery and search the query. > > > > > > > > > > > > > > > > > > > > Please see the following sample code. Replace Japanese > > > > > > > > > > "contents" and (search target) "phrase" with Korean in the > > > > > > > > > program and run. > > > > > > > > > > > > > > > > > > > > regards, > > > > > > > > > > > > > > > > > > > > Koji > > > > > > > > > > > > > > > > > > > > ============================================= > > > > > > > > > > import java.io.IOException; > > > > > > > > > > import org.apache.lucene.analysis.Analyzer; > > > > > > > > > > import org.apache.lucene.analysis.standard.StandardAnalyzer; > > > > > > > > > > import org.apache.lucene.analysis.cjk.CJKAnalyzer; > > > > > > > > > > import org.apache.lucene.store.Directory; > > > > > > > > > > import org.apache.lucene.store.RAMDirectory; > > > > > > > > > > import org.apache.lucene.index.IndexWriter; > > > > > > > > > > import org.apache.lucene.document.Document; > > > > > > > > > > import org.apache.lucene.document.Field; > > > > > > > > > > import org.apache.lucene.search.IndexSearcher; > > > > > > > > > > import org.apache.lucene.search.Hits; > > > > > > > > > > import org.apache.lucene.search.Query; > > > > > > > > > > import org.apache.lucene.queryParser.QueryParser; > > > > > > > > > > import org.apache.lucene.queryParser.ParseException; > > > > > > > > > > > > > > > > > > > > public class JapaneseByStandardAnalyzer { > > > > > > > > > > > > > > > > > > > > private static final String FIELD_CONTENT = "content"; > > > > > > > > > > private static final String[] contents = { > > > > > > > > > > "東京にはおいしいラーメン屋がたくさんあります。", > > > > > > > > > > "北海道にもおいしいラーメン屋があります。" > > > > > > > > > > }; > > > > > > > > > > private static final String phrase = "ラーメン屋"; > > > > > > > > > > //private static final String phrase = "屋"; > > > > > > > > > > private static Analyzer analyzer = null; > > > > > > > > > > > > > > > > > > > > public static void main( String[] args ) throws > > > > > > > > > IOException, ParseException { > > > > > > > > > > Directory directory = makeIndex(); > > > > > > > > > > search( directory ); > > > > > > > > > > directory.close(); > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > private static Analyzer getAnalyzer(){ > > > > > > > > > > if( analyzer == null ){ > > > > > > > > > > analyzer = new StandardAnalyzer(); > > > > > > > > > > //analyzer = new CJKAnalyzer(); > > > > > > > > > > } > > > > > > > > > > return analyzer; > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > private static Directory makeIndex() throws IOException > > > > > > > > > > { > > > > > > > > > > Directory directory = new RAMDirectory(); > > > > > > > > > > IndexWriter writer = new IndexWriter( directory, > > > > > > > > > > getAnalyzer(), true ); > > > > > > > > > > for( int i = 0; i < contents.length; i++ ){ > > > > > > > > > > Document doc = new Document(); > > > > > > > > > > doc.add( new Field( FIELD_CONTENT, contents[i], > > > > > > > > > Field.Store.YES, Field.Index.TOKENIZED ) ); > > > > > > > > > > writer.addDocument( doc ); > > > > > > > > > > } > > > > > > > > > > writer.close(); > > > > > > > > > > return directory; > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > private static void search( Directory directory ) throws > > > > > > > > > IOException, ParseException { > > > > > > > > > > IndexSearcher searcher = new IndexSearcher( directory ); > > > > > > > > > > QueryParser parser = new QueryParser( FIELD_CONTENT, > > > > > > > > > > getAnalyzer() ); > > > > > > > > > > Query query = parser.parse( phrase ); > > > > > > > > > > System.out.println( "query = " + query ); > > > > > > > > > > Hits hits = searcher.search( query ); > > > > > > > > > > for( int i = 0; i < hits.length(); i++ ) > > > > > > > > > > System.out.println( "doc = " + hits.doc( i ).get( > > > > > > > > > > FIELD_CONTENT ) ); > > > > > > > > > > searcher.close(); > > > > > > > > > > } > > > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -----Original Message----- > > > > > > > > > > > From: Youngho Cho [mailto:[EMAIL PROTECTED] > > > > > > > > > > > Sent: Thursday, October 27, 2005 8:18 AM > > > > > > > > > > > To: java-user@lucene.apache.org; Cheolgoo Kang > > > > > > > > > > > Subject: Re: korean and lucene > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Hello Cheolgoo, > > > > > > > > > > > > > > > > > > > > > > Now I updated my lucene version to 1.9 for using > > > > > > > > > > > StandardAnalyzer > > > > > > > > > > > for Korean. > > > > > > > > > > > And tested your patch which is already adopted in 1.9 > > > > > > > > > > > > > > > > > > > > > > http://issues.apache.org/jira/browse/LUCENE-444 > > > > > > > > > > > > > > > > > > > > > > But Still I have no good results with Korean compare with > > > > > > > > > CJKAnalyzer. > > > > > > > > > > > > > > > > > > > > > > Single character is good match but more two character word > > > > > > > > > > > doesn't match at all. > > > > > > > > > > > > > > > > > > > > > > Am I something missing or still there need some more > > > > > > > > > > > works ? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > > > Youngho. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ----- Original Message ----- > > > > > > > > > > > From: "Cheolgoo Kang" <[EMAIL PROTECTED]> > > > > > > > > > > > To: <java-user@lucene.apache.org>; "John Wang" <[EMAIL > > > > > > > > > > > PROTECTED]> > > > > > > > > > > > Sent: Tuesday, October 04, 2005 10:11 AM > > > > > > > > > > > Subject: Re: korean and lucene > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > StandardAnalyzer's JavaCC based StandardTokenizer.jj > > > > > > > > > > > > cannot read > > > > > > > > > > > > Korean part of Unicode character blocks. > > > > > > > > > > > > > > > > > > > > > > > > You should 1) use CJKAnalyzer or 2) add Korean character > > > > > > > > > > > > block(0xAC00~0xD7AF) to the CJK token definition on the > > > > > > > > > > > > StandardTokenizer.jj file. > > > > > > > > > > > > > > > > > > > > > > > > Hope it helps. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On 10/4/05, John Wang <[EMAIL PROTECTED]> wrote: > > > > > > > > > > > > > Hi: > > > > > > > > > > > > > > > > > > > > > > > > > > We are running into problems with searching on korean > > > > > > > > > > > documents. We are > > > > > > > > > > > > > using the StandardAnalyzer and everything works with > > > > > > > > > > > > > Chinese > > > > > > > > > > > and Japanese. > > > > > > > > > > > > > Are there known problems with Korean with Lucene? > > > > > > > > > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > > > > > > > > > > > > > > > > > > -John > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > Cheolgoo > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > > > > > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > > > > > > > > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > > > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > > > > > > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > --------------------------------------------------------------------- > > > > > > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > > > > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > --------------------------------------------------------------------- > > > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > Cheolgoo > > > -- > Cheolgoo >