[ 
https://issues.apache.org/jira/browse/LUCENE-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194623#comment-13194623
 ] 

Robert Muir edited comment on LUCENE-3726 at 1/27/12 12:24 PM:
---------------------------------------------------------------

{quote}
I'm thinking a possibility could be to expose possible decompounds as part of 
Kuromoji's Token interface.
{quote}

I like this idea: I think it would give the most flexibility, we would populate 
some attribute from 
Token just like we do today for other attributes, and then actual indexing of 
compounds can be 
controlled with a configurable tokenfilter.

Long term, this lets the tokenizer stay a tokenizer and prevents it from 
growing too complex.
                
      was (Author: rcmuir):
    {quote}
I'm thinking a possibility could be to expose possible decompounds as part of 
Kuromoji's Token interface.
{quote}

I like this idea: I think it would give the most flexibility, we would populate 
some attribute from 
Token just like we do today, and then actual decomposition can be controlled 
with a configurable tokenfilter.

Long term, this lets the tokenizer stay a tokenizer and prevents it from 
growing too complex.
                  
> Default KuromojiAnalyzer to use search mode
> -------------------------------------------
>
>                 Key: LUCENE-3726
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3726
>             Project: Lucene - Java
>          Issue Type: Improvement
>    Affects Versions: 3.6, 4.0
>            Reporter: Robert Muir
>
> Kuromoji supports an option to segment text in a way more suitable for search,
> by preventing long compound nouns as indexing terms.
> In general 'how you segment' can be important depending on the application 
> (see http://nlp.stanford.edu/pubs/acl-wmt08-cws.pdf for some studies on this 
> in chinese)
> The current algorithm punishes the cost based on some parameters 
> (SEARCH_MODE_PENALTY, SEARCH_MODE_LENGTH, etc)
> for long runs of kanji.
> Some questions (these can be separate future issues if any useful ideas come 
> out):
> * should these parameters continue to be static-final, or configurable?
> * should POS also play a role in the algorithm (can/should we refine exactly 
> what we decompound)?
> * is the Tokenizer the best place to do this, or should we do it in a 
> tokenfilter? or both?
>   with a tokenfilter, one idea would be to also preserve the original 
> indexing term, overlapping it: e.g. ABCD -> AB, CD, ABCD(posInc=0)
>   from my understanding this tends to help with noun compounds in other 
> languages, because IDF of the original term boosts 'exact' compound matches.
>   but does a tokenfilter provide the segmenter enough 'context' to do this 
> properly?
> Either way, I think as a start we should turn on what we have by default: its 
> likely a very easy win.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to