[ https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Muir updated LUCENE-4956: -------------------------------- Attachment: eval.patch I did a very quick and dirty evaluation of various analyzers (short queries only) with the HANTEC-2 test collection (http://ir.kaist.ac.kr/anthology/2000.10-%EA%B9%80%EC%A7%80%EC%98%81.pdf) I compared 4 different analyzers for index time, size, and mean average precision for the "L2" relevance set: * StandardAnalyzer (whitespace on hangul / unigrams on hanja) * CJKAnalyzer (bigram technique) * KoreanAnalyzer * MecabAnalyzer via JNI (https://github.com/bibreen/mecab-ko-lucene-analyzer) For each one, I used 3 different ranking strategies: DefaultSimilarity, BM25Similarity, and DFR GL2, no parameter tuning of any sort. ||Analyzer||Index Time||Index Size||MAP(TFIDF)||MAP(BM25)||MAP(GL2)|| |Standard|31s|128MB|.0959|.1018|.1028| |CJK|30s|162MB|.1746|.1894|.1910| |Korean|195s|125MB|.2055|.2096|.2058| |Mecab|138s|147MB|.1877|.1960|.1928| Note that on the first try, I was unable to actually index the entire collection with KoreanAnalyzer, so I had to hack the filter to prevent this: {noformat} xception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 4 at java.lang.String.substring(String.java:1907) at org.apache.lucene.analysis.ko.KoreanFilter.analysisChinese(KoreanFilter.java:405) at org.apache.lucene.analysis.ko.KoreanFilter.incrementToken(KoreanFilter.java:147) at org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54) at org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:54) at org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174) {noformat} See the patch for more information (you can also download the data from http://www.kristalinfo.com/TestCollections/ and set some constants and run it yourself). Don't read too far into it, this was really quick and dirty and might somehow be biased. For example, there are several charset issues in the test collection... But it looks like the analyzer here is effective. > the korean analyzer that has a korean morphological analyzer and dictionaries > ----------------------------------------------------------------------------- > > Key: LUCENE-4956 > URL: https://issues.apache.org/jira/browse/LUCENE-4956 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis > Affects Versions: 4.2 > Reporter: SooMyung Lee > Assignee: Christian Moen > Labels: newbie > Attachments: eval.patch, kr.analyzer.4x.tar, lucene-4956.patch, > lucene4956.patch, LUCENE-4956.patch > > > Korean language has specific characteristic. When developing search service > with lucene & solr in korean, there are some problems in searching and > indexing. The korean analyer solved the problems with a korean morphological > anlyzer. It consists of a korean morphological analyzer, dictionaries, a > korean tokenizer and a korean filter. The korean anlyzer is made for lucene > and solr. If you develop a search service with lucene in korean, It is the > best idea to choose the korean analyzer. -- This message was sent by Atlassian JIRA (v6.1#6144) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org