[ 
https://issues.apache.org/jira/browse/LUCENE-3767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13212623#comment-13212623
 ] 

Christian Moen commented on LUCENE-3767:
----------------------------------------

Mike,

Thanks a lot for this.  I'd meant to comment on this earlier and I'd like to 
look further into the details, but I really like your idea of running the 
Viterbi in a streaming fashion.

Kuromoji originally split input using two punctuation characters as this would 
be an articulation point in the lattice/graph in practice, but your idea is 
much more elegant and also faithful to the statistical model.

As for dealing with compounds, the penalization is a crude hack as you know, 
but it turns to work quite well in practice as many of the "decompounds" are 
known to the statistical model.  However, in cases where not not all of them 
are known, we sometimes get wrong decomounds.  I've done some analysis of these 
cases and it's possible to add more heuristics to deal with some that are 
obviouslt wrong, such a word starting with a long vowel sound in katakana.  
This is a slippery slope that I'm reluctant to pursue...

Robert mentioned earlier that he believes IPADIC could have been annotated with 
compounds as the documentation mentions them, but they're not part of the 
IPADIC model we are using.  If it is possible to get the decompounds from the 
training data (Kyoto Corpus), a better overall approach is then to do regular 
segmentation (normal mode) and then provide the decompounds directly from the 
token info for the compounds.  We might need to retrain the model and 
preserving the decompounds in order for this to work, but I think it is worth 
investigating.
                
> Explore streaming Viterbi search in Kuromoji
> --------------------------------------------
>
>                 Key: LUCENE-3767
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3767
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3767.patch, LUCENE-3767.patch, LUCENE-3767.patch, 
> compound_diffs.txt
>
>
> I've been playing with the idea of changing the Kuromoji viterbi
> search to be 2 passes (intersect, backtrace) instead of 4 passes
> (break into sentences, intersect, score, backtrace)... this is very
> much a work in progress, so I'm just getting my current state up.
> It's got tons of nocommits, doesn't properly handle the user dict nor
> extended modes yet, etc.
> One thing I'm playing with is to add a double backtrace for the long
> compound tokens, ie, instead of penalizing these tokens so that
> shorter tokens are picked, leave the scores unchanged but on backtrace
> take that penalty and use it as a threshold for a 2nd best
> segmentation...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to