[jira] [Resolved] (LUCENE-3916) Consider different query and index segmentation for Japanese

Christian Moen (Resolved) (JIRA) Wed, 28 Mar 2012 11:33:53 -0700

     [ 
https://issues.apache.org/jira/browse/LUCENE-3916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Christian Moen resolved LUCENE-3916.
------------------------------------

    Resolution: Fixed
    
> Consider different query and index segmentation for Japanese
> ------------------------------------------------------------
>
>                 Key: LUCENE-3916
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3916
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>    Affects Versions: 3.6, 4.0
>            Reporter: Christian Moen
>            Priority: Minor
>
> Kuromoji today uses search mode segmentation both at query and index time.
> The benefit with search mode segmentation is that it segments compounds such 
> as 関西国際空港 (Kansai International Airport) into 関西 (Kansai), 国際 
> (international), 空港 (airport), and leaves the compound 関西国際空港 as a synonym to 
> 関西.
> This segmentation allows us to get a match for 空港 (airport), which is good 
> for recall and we'd get good precision when searching for the compound 関西国際空港 
> because of IDF.
> However, if we search for the compound 関西国際空港 (Kansai International Airport) 
> our query becomes (by default) an OR-query with terms 関西 (Kansai), 関西国際空港 
> (Kansai International Airport), 国際 (international) and 空港 (airport).
> This behaviour is by-design when using OR as the default operator, but this 
> also has the effect of returning generic hits like 空港 (airport) when the user 
> searches for something very specific like 関西国際空港 (Kansai International 
> Airport) -- and these hits are also highlighted.
> This doesn't necessarily mean that ranking is flawed per se, but a user or 
> application might prefer precision over recall.  In order to favour 
> precision, we can consider using normal mode segmentation for queries, but 
> retain search mode segmentation on the indexing side.
> Does anyone have any general opinion on this?  What would we do for other 
> language in the case of compound splitting?
> Perhaps this can be dealt with as a documentation issue with a comment in 
> {{schema.xml}} while keeping the current behaviour?
> Many thanks for any input.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (LUCENE-3916) Consider different query and index segmentation for Japanese

Reply via email to