[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

Robert Muir (JIRA) Thu, 30 May 2019 19:09:45 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16852581#comment-16852581
 ]


Robert Muir commented on LUCENE-8816:
-------------------------------------

Yeah there is some optimizations specific to Japanese writing systems in the 
way we store the data: for example writing katakana with a single byte, 
hiragana<->katakana transformations, and so on. The kana really must be 
optimized, due to the way it is used in the language and the dictionary, 
otherwise it wastes tons of space.

But there are also some of these assertions about value ranges specific to 
ipadic: not so great. From my perspective, the problem of other dictionaries 
was always a licensing one. We want to at least be able to test any dictionary 
we want to support. This has changed, so I think it makes sense to look at how 
to really support other japanese dictionaries compatible with the apache 
license. It might mean representing some data differently in the worst case 
because we need more bits.

I don't have any clear idea on how to test and package that. Maybe with the 
gradle build it will become easier to build+test two jar files to make it 
easiest on the user to consume? It may not have the proper tools and APIs to 
support the custom dictionary case (yet), but it would give users choices. 

Just trying to help us think about how to bite off small pieces at a time...



> Decouple Kuromoji's morphological analyser and its dictionary
> -------------------------------------------------------------
>
>                 Key: LUCENE-8816
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8816
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Tomoko Uchida
>            Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

Reply via email to