[
https://issues.apache.org/jira/browse/LUCENE-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13197730#comment-13197730
]
Christian Moen commented on LUCENE-3726:
----------------------------------------
I've segmented some Japanese Wikipedia text into sentences (using a naive
sentence segmenter) and then segmented each sentence using both normal and
search mode with the Kuromoji on Github that has LUCENE-3730 applied.
Segmentation with Kuromoji in Lucene should be similar overall (modulo some
differences in punctuation handling).
Search mode and normal mode segmentation match completely in 90.7% of the
sentences segmented and there's a 99.6% match at the token level (when counting
normal mode tokens).
Find attached some HTML files with a total of 10,000 sentences that
demonstrates the differences in segmentation.
Overall, I think search mode does a decent job. I've written someone else
doing Japanese NLP to get their second opinion, in particular if the kanji
splitting should be made somewhat less eager to split three letter words.
> Default KuromojiAnalyzer to use search mode
> -------------------------------------------
>
> Key: LUCENE-3726
> URL: https://issues.apache.org/jira/browse/LUCENE-3726
> Project: Lucene - Java
> Issue Type: Improvement
> Affects Versions: 3.6, 4.0
> Reporter: Robert Muir
> Attachments: kuromojieval.tar.gz
>
>
> Kuromoji supports an option to segment text in a way more suitable for search,
> by preventing long compound nouns as indexing terms.
> In general 'how you segment' can be important depending on the application
> (see http://nlp.stanford.edu/pubs/acl-wmt08-cws.pdf for some studies on this
> in chinese)
> The current algorithm punishes the cost based on some parameters
> (SEARCH_MODE_PENALTY, SEARCH_MODE_LENGTH, etc)
> for long runs of kanji.
> Some questions (these can be separate future issues if any useful ideas come
> out):
> * should these parameters continue to be static-final, or configurable?
> * should POS also play a role in the algorithm (can/should we refine exactly
> what we decompound)?
> * is the Tokenizer the best place to do this, or should we do it in a
> tokenfilter? or both?
> with a tokenfilter, one idea would be to also preserve the original
> indexing term, overlapping it: e.g. ABCD -> AB, CD, ABCD(posInc=0)
> from my understanding this tends to help with noun compounds in other
> languages, because IDF of the original term boosts 'exact' compound matches.
> but does a tokenfilter provide the segmenter enough 'context' to do this
> properly?
> Either way, I think as a start we should turn on what we have by default: its
> likely a very easy win.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]