[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-06-19 Thread Mike Sokolov (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16867977#comment-16867977
 ] 

Mike Sokolov commented on LUCENE-8816:
--

LUCENE-8871 opened to cover moving dictionary builder tools into main kuromoji 
source tree, mostly so it gets tested properly.

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-06-19 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16867793#comment-16867793
 ] 

Tomoko Uchida commented on LUCENE-8816:
---

I opened LUCENE-8869 to just decouple the system dictionary resource to a 
separated jar, and use it as default (as is).

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-06-15 Thread Mike Sokolov (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16864672#comment-16864672
 ] 

Mike Sokolov commented on LUCENE-8816:
--

I opened LUCENE-8863 to cover some small, but blocking, issues I uncovered 
while loading a custom dictionary.

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-06-11 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861672#comment-16861672
 ] 

Tomoko Uchida commented on LUCENE-8816:
---

I'd like to add some more information: the leftID (and rightID) is tied to the 
POS tags and in practice there are not so much pos tag variations. I think the 
current constraint {{leftId < 4096}} (or {{leftId < 8191}}, if it can be easily 
changed so) is perfectly okay if following conditions are met.

1. The dictionary learner/re-trainer program included in the mecab-ipadic 
devtool does not generate leftID (and rightID) values larger than 4196 (or 
8191).
2. UniDic (I'd like to support this dictionary on this issue as I wrote in the 
issue description) has no leftID (and rightID) values greater than 4196 (or 
8191).
3. A few well-known variants of mecab-ipadic or unidic does not have leftID 
(and rightID) values larger than 4196 (or 8191).

Give me some time to examine if we need to re-consider the constraint. (It's 
just a guess but the original mecab itself is also a performance-savvy 
software, so it could have similar restrictions for its dictionary format.) At 
least about the point 3, I think I can talk with the dictionary developers 
about it before tackling with Lucene code, if it's needed.

There is another possibility that users give large values to leftIDs (and 
rightIDs) in their customized dictionary by hand, however I don't think we 
should take care about that. I have no idea about Korian dictionaries.

I agree with that it will be better to change the all assertions to some 
Exceptions so that users can figure out the problem with their customized 
dictionary.

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-06-11 Thread Mike Sokolov (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861622#comment-16861622
 ] 

Mike Sokolov commented on LUCENE-8816:
--

Thanks Robert, yeah I understand this was built for a single dictionary, not a 
general-purpose tool, and hardening is required to enable wider usage.

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-06-11 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861618#comment-16861618
 ] 

Robert Muir commented on LUCENE-8816:
-

Mike: yes, I agree with you. The use of assert was laziness on my part: we 
should treat it as technical debt and fix it. see my earlier comments on this 
JIRA issue for elaboration.

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-06-11 Thread Mike Sokolov (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16861609#comment-16861609
 ] 

Mike Sokolov commented on LUCENE-8816:
--

I see that in {{BinaryDictionaryWriter}} we restrict incoming leftID (and 
rightID) to be < 4096 because we are going to pack into a 16-bit short with 3 
flag bits. However it seems we have room for one more bit (since 2^(16-3) == 
8192). Am I missing something? Do we use that other bit somewhere? I see eg 
that in {{BinaryDictionary}} when we decode, we >>> 3 to get back the ids, so I 
think it should be OK to allow ids up to 8191. [~rcmuir] do you know why it is 
currenly limited to 4096? Also I think it would make sense to change the 
asserts there to be IllegalArgumentException so they are raised whenever the 
tool is run, since we would get garbage if this limit is exceeded, and (I 
think) nothing else will catch it.

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-06-01 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853784#comment-16853784
 ] 

Namgyu Kim commented on LUCENE-8816:


Thanks! [~tomoko] :D

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-06-01 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853783#comment-16853783
 ] 

Tomoko Uchida commented on LUCENE-8816:
---

Hi [~danmuzi],

thanks for opening the issue. I hope both changes will be safely merged at some 
point in time.
{quote}I'll notify you if there is some progress.
{quote}
I will keep watching your updates in there, though I'd like to withhold 
approval or disapproval to it.

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-06-01 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853754#comment-16853754
 ] 

Namgyu Kim commented on LUCENE-8816:


Thanks for your reply! [~tomoko]


 I created the LUCENE-8817 issue and set up this issue with the "related to" 
link.
 I'll notify you if there is some progress.

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-06-01 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853720#comment-16853720
 ] 

Tomoko Uchida commented on LUCENE-8816:
---

OK, if you will be able to merge two dictionary builders without any affects on 
kuromoji, just open the issue (don't forget to add a link to here) and work for 
it. Once your branch/patch is successfully approved to push to the upstream 
Lucene repo (ASF gitbox), I will merge that from upstream and continue my work.

Please keep in mind: I won't merge any branch/patch to my local branch which 
has not yet merged to upstream master  (in other words, I will merge or 
cherry-pick only from upstream master). And vise versa, never merge my WIP 
branch/patch to your branch/patch.

I cannot take time to review your branch/patch (anyway I am still new to 
kuromoji or nori), hope someone with commit privilege take care [~danmuzi]'s 
patch.

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-06-01 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853678#comment-16853678
 ] 

Namgyu Kim commented on LUCENE-8816:


Oh, you're right. [~tomoko] :D
 I'll make a new JIRA issue after clean up the changes.
{quote}To avoid confusion, personally I'd like to proceed things in a right 
order - cleaning up first, then generalizing. But if you are sure that we can 
go in parallel, can you share your plan?
{quote}
Sure. It's an important thing.
 I think we can proceed in parallel.

There are two possible cases.
 1) You finish between JapaneseTokenizer and DictionaryBuilder job first.
 In that case, I can pull your new code and merge with nori's DictionaryBuilder.

2) I finish merging DictionaryBuilder(nori) and DictionaryBuilder(kuromoji) 
first.
 In that case, you can pull and continue.
 The DictionaryBuilder logic of kuromoji does not change at all in my work.

But if you think it is a little inefficient, I'll do later.
 What do you think about it?

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-05-31 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853504#comment-16853504
 ] 

Tomoko Uchida commented on LUCENE-8816:
---

{quote}But I think BinaryDirectoryWriter case can be solved if we separate 
methods. (+ use DictionaryFormat)
 Can I try this when you are concentrating on JapaneseTokenizer?
{quote}
Please open another issue for this. Any generalization of DictionaryBuilder and 
its auxiliary classes should be treated another issue, if you agree with my 
plan.
{quote} - First, decouple the encoded system dictionary (mecab-ipadic) to a 
separated jar from the kuromoji jar and clean up the dictionary builder tool. 
This is the scope of this issue.
 - Then generalize the dictionary builder tool to make it able to handle Korean 
dictionary (mecab-ko-dic), on the separated issue.
 - Lastly decouple the korean system dictionary to a separated jar from the 
nori jar, maybe on the another issue.{quote}
Just to make things clear, I will focus on JapaneseTokenizer *and* 
DictionaryBuilder in kuromoji module on this issue so the DictionaryBuilder 
(including its auxiliary classes) can be significantly modified here. To avoid 
confusion, personally I'd like to proceed things in a right order - cleaning up 
first, then generalizing. But if you are sure that we can go in parallel, can 
you share your plan?

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-05-31 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16853207#comment-16853207
 ] 

Namgyu Kim commented on LUCENE-8816:


{quote}So I thought we can share only DictionaryBuilder between Kuromoji and 
Nori module, as I wrote in the previous comment.
{quote}
I agree with your idea. [~tomoko]

I don't think it would be difficult to merge DictionaryBuilder. (except 
BinaryDirectoryWriter)
 But I think BinaryDirectoryWriter case can be solved if we separate methods. 
(+ use DictionaryFormat)
 Can I try this when you are concentrating on JapaneseTokenizer?

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-05-31 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852706#comment-16852706
 ] 

Tomoko Uchida commented on LUCENE-8816:
---

Related to the comment from [~rcmuir],
{quote}This has changed, so I think it makes sense to look at how to really 
support other japanese dictionaries compatible with the apache license. It 
might mean representing some data differently in the worst case because we need 
more bits.
{quote}
I will try to describe my current rough thoughts.

As far as Japanese dictionaries, I think we should restrict supported 
dictionaries to only "[MeCab IPADIC|https://taku910.github.io/mecab/#download]; 
and "[UniDic|https://unidic.ninjal.ac.jp/download#unidic_bccwj];, both are 
distributed under OSS licenses which are compatible with ASF's policy. Although 
there are a few well-known extensions of them or user customized ones already 
used with Kuromoji in the wild, we need not (actually, cannot) to give 
"official" support or strict compatibility policy to the variants.

Nevertheless, it would be good for users to make small adjustments at some 
point of the implementation to allow to build well-known variants of 
mecab-ipadic or unidic. More concretely, for example: "relaxing the maximum 
string length to be allowed as inputs from X to Y when option Z is given, with 
possible performance degredation or unoptimized data representation".

Does that make sense to you? Currently I have no idea about the performance or 
bit level data size tuning which have been already done, so please correct me 
if I missed the points.

Anyway I'd like to start from understanding the exact meaning of the current 
assertions/restrictions in the builder class, and let me discuss with you about 
how we should/can change this, a little later :)

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-05-30 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852595#comment-16852595
 ] 

Tomoko Uchida commented on LUCENE-8816:
---

Thank you guys, for your comments and suggestions.

To me (having little knowledge about the code for now), to be honest it looks 
quite difficult to merge the two tokenizers because those seems have been 
greatly evolved respectively. So I thought we can share only DictionaryBuilder 
between Kuromoji and Nori, as I wrote in the previous comment. However if it 
can be done in the future, this will be great unification.

For the time being, I will work for JapaneseTokenizer and welcome 
implementation advices from the viewpoint of possible future integration.

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-05-30 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852581#comment-16852581
 ] 

Robert Muir commented on LUCENE-8816:
-

Yeah there is some optimizations specific to Japanese writing systems in the 
way we store the data: for example writing katakana with a single byte, 
hiragana<->katakana transformations, and so on. The kana really must be 
optimized, due to the way it is used in the language and the dictionary, 
otherwise it wastes tons of space.

But there are also some of these assertions about value ranges specific to 
ipadic: not so great. From my perspective, the problem of other dictionaries 
was always a licensing one. We want to at least be able to test any dictionary 
we want to support. This has changed, so I think it makes sense to look at how 
to really support other japanese dictionaries compatible with the apache 
license. It might mean representing some data differently in the worst case 
because we need more bits.

I don't have any clear idea on how to test and package that. Maybe with the 
gradle build it will become easier to build+test two jar files to make it 
easiest on the user to consume? It may not have the proper tools and APIs to 
support the custom dictionary case (yet), but it would give users choices. 

Just trying to help us think about how to bite off small pieces at a time...



> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-05-30 Thread Christian Moen (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852551#comment-16852551
 ] 

Christian Moen commented on LUCENE-8816:


Separating out the dictionaries is a great idea.

[~rcmuir] made great efforts making the original dictionary tiny and some 
assumptions were made based on the value ranges of the original source data.

To me it sounds like a good idea to keep the Japanese and Korean dictionaries 
separately initially and consider combining them later on when implications of 
such combination is clear.  I agree with [~jim.ferenczi].

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-05-30 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16852084#comment-16852084
 ] 

Namgyu Kim commented on LUCENE-8816:


That's a good plan, [~tomoko]!

About Tokenizer...
I'm not sure that is better to combine KoreanTokenizer and JapaneseTokenizer.

If we want, there is a way.
KoreanTokenizer and JapaneseTokenizer have parts with duplicate code.
We can create an abstract class called MecabTokenizer. (both are mecab based)
MecabTokenizer will have duplicate things. (like isPunctuation(), 
WrappedPositionArray inner-class, ...)
What do you think about it?

Anyway, combining KoreanTokenizer and JapaneseTokenizer will be a big task and 
I think it's right to create a new issue.

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-05-30 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16851574#comment-16851574
 ] 

Jim Ferenczi commented on LUCENE-8816:
--

This sounds like a great plan [~tomoko]. Decoupling the system dictionary 
should help for the merge of the Korean tokenizer but I agree that this merge 
is out of scope for this issue.

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-05-30 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16851556#comment-16851556
 ] 

Tomoko Uchida commented on LUCENE-8816:
---

I looked over the diffs between current Kuromoji and Nori code, to examine we 
are able to integrate those.

There are many overlaps (copied lines) but also a lot differences which cannot 
be merged easily. (I've not studied about the details, but there may have been 
Korean dictionary/tokenizer specific modifications?)

I've heard MeCab and MeCab IPADIC was designed to be language independent, but 
it seems things are not as easy as one sees. :-)

I think the generalization/integration (if it's possible) should be treated in 
different issues. I'd like to propose following:
 - Keep JapaneseTokenizer and KoreanTokenizer as individual tokenizers (as is). 
Merging them seems not to be feasible to me, but others might have good 
solutions or ideas for it.
 - First, decouple the encoded system dictionary (mecab-ipadic) to a separated 
jar from the kuromoji jar and clean up the dictionary builder tool. This is the 
scope of this issue.
 - Then generalize the dictionary builder tool to make it able to handle Korean 
dictionary (mecab-ko-dic), on the separated issue.
 - Lastly decouple the korean system dictionary to a separated jar from the 
nori jar, maybe on the another issue.

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-05-29 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16851477#comment-16851477
 ] 

Tomoko Uchida commented on LUCENE-8816:
---

Thaks [~rcmuir],

I checked current assertions in the build tool. It seems there are mecab-ipadic 
specific constraints and more general ones which have to be checked for both of 
mecab-ipadic and unidic. The former should be relaxed to support mecab-ipadic 
extensions or UniDic format, and latter one would be better to be modified to 
exceptions as suggested.

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-05-29 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16851103#comment-16851103
 ] 

Namgyu Kim commented on LUCENE-8816:


Thanks for the answers :D

Oh, I think your opinion is good, [~tomoko].
It is important to ensure that users do not modify the source code.
If we apply that approach, it seems to be able to create a build tool that 
supports two pre-types in the CLI environment.
And it's very important to be good at handling exceptions as [~rcmuir] said.
Otherwise, users will think that stability is poor.

About version control, I totally agree with you.

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-05-29 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16850954#comment-16850954
 ] 

Robert Muir commented on LUCENE-8816:
-

As a suggestion for the first concrete step: the various {{assert}}s in the 
current tooling should be fixed to be real {{if}} statements+exception 
messages. I think the current use of asserts was just laziness on my part that 
should be fixed if we want to try to make the tooling more usable and robust.

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-05-28 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16850233#comment-16850233
 ] 

Tomoko Uchida commented on LUCENE-8816:
---

Hi [~danmuzi],

thanks for speaking up about this.

The dictionary builder tool should be included in Lucene binary distribution 
tar/zip so that developers will not need to download Lucene source code. I 
imagine it will be an ordinary command-line tool that runs on JDK with options 
to specify the dictionary format ("mecab-ipadic" or "unidic") and raw 
dictionary data location (something like {{-path}} or {{-url}}).

About version control, I believe that dictionary data encoding should be 
compatible at least within same Lucene major version (like Codecs) so that 
users do not need to re-build the dictionary at every Lucene/Solr or 
Elasticsearch minor version up. Ideally the dictionary version is independent 
from Lucene version, but the encoding could be changed for some reason (e.g. 
performance improvement) in the future. Anyway I think it happens not so often.

 

 

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-05-28 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849967#comment-16849967
 ] 

Namgyu Kim commented on LUCENE-8816:


Hi everybody,
 Thank you for opening the issue! [~tomoko]

To be honest, at first, when I talked about a custom system dictionary, I did 
not see a big sight.

Anyway, the structure I think is as follows.

 

1. As Tomoko said, make developer-oriented dictionary build tool
 The "ant regenerate" command inside the build.xml that I checked has the 
following steps.
  1) Compile the code (compile-tools)
  2) Download the jar file (download-dict)
  3) Save Noun.proper.csv diffs (patch-dict)
  4) Run DictionaryBuilder and make dat files (build-dict)

It does not matter if user builds only system dictionary. (Of course there is a 
problem to modify the classpath)
 (ex) ant build-dict ipadic /home/my/path/customDicIn(custom-dic input) 
/home/my/path/customDicOutput(dat output) utf-8 false

However, if user needs to get a dictionary from the server, they should modify 
the build.xml.
 As I know, the url path is hard-coded.
 Of course, the user can run by modifying ivy.xml and build.xml.
 But my personal opinion is user should not touch Lucene's internal code. (even 
build script)
 Maybe the user is afraid to change or feel reluctant to use it. (Especially 
who have not used Apache Ant)
 However, I think that this may be different for every person.

 

2. Version Control
 I actually think this is the biggest problem.
 As I mentioned in the email,
 if the Lucene version goes up, users have to rebuild their system dictionary 
unconditionally and put it in the jar.
 Because the current process is,
  1) Process like 1.
  2) move users system directory dat files to 
resources/org.apache.lucene.analysis.ja.dict
  3) ant jar

Because of the 3), the user always has to rebuild kuromoji module or fix the 
kuromoji jar.
 The users can feel irritated, when there is no kuromoji module change in the 
version up.

This problem can be solved easily if the system dictionary can only be 
parameterized in JapaneseTokenizer.
 (Of course, the expert javadoc is required)

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-05-28 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849840#comment-16849840
 ] 

Tomoko Uchida commented on LUCENE-8816:
---

Hi [~jim.ferenczi],

I just skimmed your comments and will try to answer your questions. Correct me 
if I missed your points.
{quote}This could be a nice cleanup if the goal is to handle multiple mecab 
dictionaries (in different languages).
{quote}
I didn't think that but of course it seems good unification to me.
  
{quote}While allowing more flexibility would be nice I wonder if there are that 
many different dictionaries.
{quote}
I my view, there are only two dictionary formats we should support (MeCab 
IPADIC and UniDic). There are some other old dictionaries - NAIST-jdic or 
ChaSen ipadic - but they are completely obsolete now and rarely used (as far as 
I know). 
 There are several well-known extensions of mecab-ipadic and unidic, so we can 
support almost all common variants (in Japan) by supporting those.
{quote}If the ipadic is obsolete we could also adapt the main distribution 
(kuromoji) to use the UniDic instead.
{quote}
Yes, I think so. But I am not sure that we should select UniDic as default 
immediately. While users often complain about MeCab IPADIC, it is still 
high-quality and widely accepted dictionary. And even when we change the 
default dictionary to UniDic (I think it is definitely OK after giving some 
time to users for migrating/testing their applications), we have to provide the 
option to use old MeCab IPADIC for users who trust it and do not need 
"contemporary words".
{quote}Even if we handle multiple dictionaries we'll still need to provide a 
way for users to add custom entries. Mecab has an option to compute the leftId, 
rightId and cost automatically from a partial user entry so I wonder if this 
could help to avoid users to reimplement a dictionary from scratch ?
{quote}
Yes, custom entries (user dictionaries) are needed option to customize 
tokenization behaviour, for uses with little NLP skill :)

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-05-28 Thread Jim Ferenczi (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849778#comment-16849778
 ] 

Jim Ferenczi commented on LUCENE-8816:
--

We discussed this when we added the Korean module and said that we could have a 
separate module to handle "mecab-like" tokenization and one module per 
dictionary (ipadic, mecab-ko-dic, ...). There are some assertions in the 
JapaneseTokenizer that checks some invariant of the ipadic (leftId == rightId 
for instance) but I guess we could move them in the dictionary module. This 
could be a nice cleanup if the goal is to handle multiple mecab dictionaries 
(in different languages).

 

{quote}

While it has been slowly obsoleted, well-maintained and/or extended 
dictionaries risen up in recent years (e.g. 
[mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
[UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
attempts/projects/efforts are made in Japan.

{quote}

 

While allowing more flexibility would be nice I wonder if there are that many 
different dictionaries. If the ipadic is obsolete we could also adapt the main 
distribution (kuromoji) to use the UniDic instead. Even if we handle multiple 
dictionaries we'll still need to provide a way for users to add custom entries. 
Mecab has an option to compute the leftId, rightId and cost automatically from 
a partial user entry so I wonder if this could help to avoid users to 
reimplement a dictionary from scratch ?

 

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-05-28 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849755#comment-16849755
 ] 

Tomoko Uchida commented on LUCENE-8816:
---

[~rcmuir] [~sokolov]: Thanks for the comments and advice!

bq. What if we changed the various dictionary classes to load-on-demand from a 
configurable classpath-directory, rather than from a single built-in one?

About this point, I thought it would be better the each encoded dictionary has 
its "name" and users specify the dictionary name to be used at some point of 
initialization (for safety reasons because there could be multiple re-trained 
models), rather than completely auto discovering something like SLF4J does so. 
Just an idea.

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-05-28 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849734#comment-16849734
 ] 

Robert Muir commented on LUCENE-8816:
-

>From the perspective of this code, the dictionaries are basically in different 
>formats. Does it make sense? It is more complicated than just "different words 
>and costs".

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-05-28 Thread Mike Sokolov (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849731#comment-16849731
 ] 

Mike Sokolov commented on LUCENE-8816:
--

What if we changed the various dictionary classes to load-on-demand from a 
configurable classpath-directory, rather than from a single built-in one? If we 
do that, then users can supply a separate jar containing only the model files, 
and reference it when initializing the JapaneseAnalyzer. 

> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8816) Decouple Kuromoji's morphological analyser and its dictionary

2019-05-28 Thread Robert Muir (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16849725#comment-16849725
 ] 

Robert Muir commented on LUCENE-8816:
-

if the licensing challenge in LUCENE-4056 is overcome, then it just means only 
technical hurdles :) 

it is possible to now "support" different dictionaries (maybe by storing some 
data differently depending on dictionary) but I think its important to consider 
how tests (build system) would work. Some of the data structures etc are 
complicated, so you really need the unit test suite to know things are working 
correctly. The current flow (changing build.xml) does support this for advanced 
users.

I am concerned about a commandline tool because, without testing, the stuff it 
generates may fail unexpectedly at runtime due to bugs or assumptions...


> Decouple Kuromoji's morphological analyser and its dictionary
> -
>
> Key: LUCENE-8816
> URL: https://issues.apache.org/jira/browse/LUCENE-8816
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/analysis
>Reporter: Tomoko Uchida
>Priority: Major
>
> I've inspired by this mail-list thread.
>  
> [http://mail-archives.apache.org/mod_mbox/lucene-java-user/201905.mbox/%3CCAGUSZHA3U_vWpRfxQb4jttT7sAOu%2BuaU8MfvXSYgNP9s9JNsXw%40mail.gmail.com%3E]
> As many Japanese already know, default built-in dictionary bundled with 
> Kuromoji (MeCab IPADIC) is a bit old and no longer maintained for many years. 
> While it has been slowly obsoleted, well-maintained and/or extended 
> dictionaries risen up in recent years (e.g. 
> [mecab-ipadic-neologd|https://github.com/neologd/mecab-ipadic-neologd], 
> [UniDic|https://unidic.ninjal.ac.jp/]). To use them with Kuromoji, some 
> attempts/projects/efforts are made in Japan.
> However current architecture - dictionary bundled jar - is essentially 
> incompatible with the idea "switch the system dictionary", and developers 
> have difficulties to do so.
> Traditionally, the morphological analysis engine (viterbi logic) and the 
> encoded dictionary (language model) had been decoupled (like MeCab, the 
> origin of Kuromoji, or lucene-gosen). So actually decoupling them is a 
> natural idea, and I feel that it's good time to re-think the current 
> architecture.
> Also this would be good for advanced users who have customized/re-trained 
> their own system dictionary.
> Goals of this issue:
>  * Decouple JapaneseTokenizer itself and encoded system dictionary.
>  * Implement dynamic dictionary load mechanism.
>  * Provide developer-oriented dictionary build tool.
> Non-goals:
>   * Provide learner or language model (it's up to users and should be outside 
> the scope).
> I have not dove into the code yet, so have no idea about it's easy or 
> difficult at this moment.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org