[ 
https://issues.apache.org/jira/browse/LUCENE-6837?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14978156#comment-14978156
 ] 

Christian Moen commented on LUCENE-6837:
----------------------------------------

Thanks a lot for this, Konno-san.  Very nice work!  I like the idea to 
calculate the n-best cost using examples.

Since search mode and also extended mode solves a similar problem, I'm 
wondering if it makes sense to introduce n-best as a separate mode in itself.  
In your experience in developing the feature, do you think it makes a lot of 
sense to use it with search and extended mode?

I think I'm in favour of supporting it for all the modes, even though it 
perhaps makes the most sense for normal mode.  The reason for this is to make 
sure that the entire API for {{JapaneseTokenizer}} is functional for all the 
tokenizer modes.

I'll add a few tests and I'd like to commit this soon.

> Add N-best output capability to JapaneseTokenizer
> -------------------------------------------------
>
>                 Key: LUCENE-6837
>                 URL: https://issues.apache.org/jira/browse/LUCENE-6837
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 5.3
>            Reporter: KONNO, Hiroharu
>            Priority: Minor
>         Attachments: LUCENE-6837.patch
>
>
> Japanese morphological analyzers often generate mis-segmented tokens. N-best 
> output reduces the impact of mis-segmentation on search result. N-best output 
> is more meaningful than character N-gram, and it increases hit count too.
> If you use N-best output, you can get decompounded tokens (ex: 
> "シニアソフトウェアエンジニア" => {"シニア", "シニアソフトウェアエンジニア", "ソフトウェア", "エンジニア"}) and 
> overwrapped tokens (ex: "数学部長谷川" => {"数学", "部", "部長", "長谷川", "谷川"}), 
> depending on the dictionary and N-best parameter settings.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to