[
https://issues.apache.org/jira/browse/LUCENE-3916?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13240610#comment-13240610
]
Christian Moen commented on LUCENE-3916:
----------------------------------------
Thanks a lot, Robert.
I've added a comment about about this in {{schema.xml}} as part of SOLR-3276.
I'm resolving this issue.
> Consider different query and index segmentation for Japanese
> ------------------------------------------------------------
>
> Key: LUCENE-3916
> URL: https://issues.apache.org/jira/browse/LUCENE-3916
> Project: Lucene - Java
> Issue Type: Improvement
> Components: modules/analysis
> Affects Versions: 3.6, 4.0
> Reporter: Christian Moen
> Priority: Minor
>
> Kuromoji today uses search mode segmentation both at query and index time.
> The benefit with search mode segmentation is that it segments compounds such
> as 関西国際空港 (Kansai International Airport) into 関西 (Kansai), 国際
> (international), 空港 (airport), and leaves the compound 関西国際空港 as a synonym to
> 関西.
> This segmentation allows us to get a match for 空港 (airport), which is good
> for recall and we'd get good precision when searching for the compound 関西国際空港
> because of IDF.
> However, if we search for the compound 関西国際空港 (Kansai International Airport)
> our query becomes (by default) an OR-query with terms 関西 (Kansai), 関西国際空港
> (Kansai International Airport), 国際 (international) and 空港 (airport).
> This behaviour is by-design when using OR as the default operator, but this
> also has the effect of returning generic hits like 空港 (airport) when the user
> searches for something very specific like 関西国際空港 (Kansai International
> Airport) -- and these hits are also highlighted.
> This doesn't necessarily mean that ranking is flawed per se, but a user or
> application might prefer precision over recall. In order to favour
> precision, we can consider using normal mode segmentation for queries, but
> retain search mode segmentation on the indexing side.
> Does anyone have any general opinion on this? What would we do for other
> language in the case of compound splitting?
> Perhaps this can be dealt with as a documentation issue with a comment in
> {{schema.xml}} while keeping the current behaviour?
> Many thanks for any input.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]