[ 
https://issues.apache.org/jira/browse/LUCENE-3767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13214217#comment-13214217
 ] 

Christian Moen commented on LUCENE-3767:
----------------------------------------

Robert, some comments are below.

{quote}
And I tend to like Mike's improvements from a relevance perspective for these 
reasons:

# keeping the original compound term for improved precision
# preventing compound decomposition from having any unrelated negative impact 
on the rest of the tokenization
{quote}

Very good points.

{quote}
So I think we should pursue this change, even if we want to separately train a 
dictionary in the future, because in that case, we would just disable the kanji 
decomposition heuristic but keep the heuristic (obviously re-tuned!) for 
katakana?
{quote}

I agree completely.

{quote}
The dictionary documentation for the original ipadic has the ability to hold 
compound data (not in mecab-ipadic though, so maybe it was never 
implemented?!), but I don't actually see it in any implementations. So yeah, we 
would need to find a corpus containing compound information (and of course 
extend the file format and add support to kuromoji) to support that.

However, would this really solve the total issue? Wouldn't that really only 
help for known kanji compounds... whereas most katakana compounds (e.g. the 
software engineer example) are expected to be OOV anyway? So it seems like, 
even if we ensured the dictionary was annotated for long kanji such that we 
always used decompounded forms, we need a 'heuristical' decomposition like 
search-mode either way, at least for the unknown katakana case?
{quote}

I've made an inquiry to a friend who did his PhD work at Prof. Matsumoto's lab 
at NAIST (where ChaSen was made) and I've made en inquiry regarding compound 
information and the Kyoto Corpus.


You are perfectly right that this doesn't solve the complete problem as unknown 
words can actually be compounds -- unknown compounds.  The approach used today 
is basically adding all the potential decompounds the model knows about to the 
lattice and see if a short path can be found often in combination with an 
unknown word.

We get errors such as クイーンズコミックス (Queen's Comics) becoming クイーン ズコミックス (Queen 
Zukuomikkusu) because クイーン (Queen) is known.

I'll open up a separate JIRA for discussing search-mode improvements. :)

                
> Explore streaming Viterbi search in Kuromoji
> --------------------------------------------
>
>                 Key: LUCENE-3767
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3767
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Michael McCandless
>            Assignee: Michael McCandless
>             Fix For: 3.6, 4.0
>
>         Attachments: LUCENE-3767.patch, LUCENE-3767.patch, LUCENE-3767.patch, 
> compound_diffs.txt
>
>
> I've been playing with the idea of changing the Kuromoji viterbi
> search to be 2 passes (intersect, backtrace) instead of 4 passes
> (break into sentences, intersect, score, backtrace)... this is very
> much a work in progress, so I'm just getting my current state up.
> It's got tons of nocommits, doesn't properly handle the user dict nor
> extended modes yet, etc.
> One thing I'm playing with is to add a double backtrace for the long
> compound tokens, ie, instead of penalizing these tokens so that
> shorter tokens are picked, leave the scores unchanged but on backtrace
> take that penalty and use it as a threshold for a 2nd best
> segmentation...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to