[jira] [Updated] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

Robert Muir (JIRA) Thu, 17 Oct 2013 00:17:21 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Robert Muir updated LUCENE-4956:
--------------------------------

    Attachment: eval.patch

I did a very quick and dirty evaluation of various analyzers (short queries 
only) with the HANTEC-2 test collection 
(http://ir.kaist.ac.kr/anthology/2000.10-%EA%B9%80%EC%A7%80%EC%98%81.pdf)

I compared 4 different analyzers for index time, size, and mean average 
precision for the "L2" relevance set: 
* StandardAnalyzer (whitespace on hangul / unigrams on hanja)
* CJKAnalyzer (bigram technique)
* KoreanAnalyzer
* MecabAnalyzer via JNI (https://github.com/bibreen/mecab-ko-lucene-analyzer)

For each one, I used 3 different ranking strategies: DefaultSimilarity, 
BM25Similarity, and DFR GL2, no parameter tuning of any sort.

||Analyzer||Index Time||Index Size||MAP(TFIDF)||MAP(BM25)||MAP(GL2)||
|Standard|31s|128MB|.0959|.1018|.1028|
|CJK|30s|162MB|.1746|.1894|.1910|
|Korean|195s|125MB|.2055|.2096|.2058|
|Mecab|138s|147MB|.1877|.1960|.1928|

Note that on the first try, I was unable to actually index the entire 
collection with KoreanAnalyzer, so I had to hack the filter to prevent this:
{noformat}
xception in thread "main" java.lang.StringIndexOutOfBoundsException: String 
index out of range: 4
        at java.lang.String.substring(String.java:1907)
        at 
org.apache.lucene.analysis.ko.KoreanFilter.analysisChinese(KoreanFilter.java:405)
        at 
org.apache.lucene.analysis.ko.KoreanFilter.incrementToken(KoreanFilter.java:147)
        at 
org.apache.lucene.analysis.core.LowerCaseFilter.incrementToken(LowerCaseFilter.java:54)
        at 
org.apache.lucene.analysis.util.FilteringTokenFilter.incrementToken(FilteringTokenFilter.java:54)
        at 
org.apache.lucene.index.DocInverterPerField.processFields(DocInverterPerField.java:174)
{noformat}

See the patch for more information (you can also download the data from 
http://www.kristalinfo.com/TestCollections/ and set some constants and run it 
yourself).

Don't read too far into it, this was really quick and dirty and might somehow 
be biased. For example, there are several charset issues in the test 
collection... But it looks like the analyzer here is effective.


> the korean analyzer that has a korean morphological analyzer and dictionaries
> -----------------------------------------------------------------------------
>
>                 Key: LUCENE-4956
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4956
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>    Affects Versions: 4.2
>            Reporter: SooMyung Lee
>            Assignee: Christian Moen
>              Labels: newbie
>         Attachments: eval.patch, kr.analyzer.4x.tar, lucene-4956.patch, 
> lucene4956.patch, LUCENE-4956.patch
>
>
> Korean language has specific characteristic. When developing search service 
> with lucene & solr in korean, there are some problems in searching and 
> indexing. The korean analyer solved the problems with a korean morphological 
> anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
> korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
> and solr. If you develop a search service with lucene in korean, It is the 
> best idea to choose the korean analyzer.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Updated] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

Reply via email to