[jira] [Comment Edited] (LUCENE-8817) Combine Nori and Kuromoji DictionaryBuilder

2019-06-08 Thread Tomoko Uchida (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16859344#comment-16859344
 ] 

Tomoko Uchida edited comment on LUCENE-8817 at 6/9/19 2:45 AM:
---

For me, it looks like a good starting point to create a directory 
{{analysis/mecab}} and place {{mecab-tools}} module (the option 4) under that.

We are already considering further integration of kuromoij and nori (at 
LUCENE-8816 and LUCENE-8812), and I suppose it would happen sooner or later. So 
how about looking at some grand design from here. For example:
{code:java}
analysis
└── mecab
 ├── common (module: analyzers-mecab-common)
 │   ├── build.xml
 │   └── src
 ├── kuromoji (module: analyzers-mecab-kuromoji)
 │   ├── build.xml
 │   └── src
 ├── nori (module: analyzers-mecab-nori)
 │   ├── build.xml
 │   └── src
 └── tools  (module: analyzers-mecab-tools)
 ├── build.xml
 └── src
{code}
On this issue, only "mecab-tools" module will be added (and the dependency on 
that should be added to current kuromoji and nori).
 That's just an idea and I am not an expert about the shadow maven poms. 
[~rcmuir], [~jim.ferenczi] and [~cm] may have different thoughts.

About the option 2, I don't think that's a good idea to change other modules' 
current structure (analysis-common or icu), opinions?
{quote}I will go ahead if direction is set, but landing will be delayed a 
little.
 The reason is that the build system is going to change. (SOLR-13452)
 But if it does not matter, I will proceed.
{quote}
We need to take care the ongoing change in build infrastructure, but I think it 
would not be a very big concern that stops the work here (and LUCENE-8816) :) 
After pushing the commits to the master, you would be able to backport the 
changes to the gradle branch (I think Mark Miller and others will give you help 
or advice for the work).


was (Author: tomoko uchida):
For me, it looks like a good starting point to create a directory 
{{analysis/mecab}} and place {{mecab-tools}} module (the option 4) under that.

We are already considering further integration of kuromoij and nori (at 
LUCENE-8816 and LUCENE-8812), and I suppose it would happen sooner or later. So 
how about looking at some ground design from here. For example:
{code:java}
analysis
└── mecab
 ├── common (module: analyzers-mecab-common)
 │   ├── build.xml
 │   └── src
 ├── kuromoji (module: analyzers-mecab-kuromoji)
 │   ├── build.xml
 │   └── src
 ├── nori (module: analyzers-mecab-nori)
 │   ├── build.xml
 │   └── src
 └── tools  (module: analyzers-mecab-tools)
 ├── build.xml
 └── src
{code}
On this issue, only "mecab-tools" module will be added (and the dependency on 
that should be added to current kuromoji and nori).
 That's just an idea and I am not an expert about the shadow maven poms. 
[~rcmuir], [~jim.ferenczi] and [~cm] may have different thoughts.

About the option 2, I don't think that's a good idea to change other modules' 
current structure (analysis-common or icu), opinions?
{quote}I will go ahead if direction is set, but landing will be delayed a 
little.
 The reason is that the build system is going to change. (SOLR-13452)
 But if it does not matter, I will proceed.
{quote}
We need to take care the ongoing change in build infrastructure, but I think it 
would not be a very big concern that stops the work here (and LUCENE-8816) :) 
After pushing the commits to the master, you would be able to backport the 
changes to the gradle branch (I think Mark Miller and others will give you help 
or advice for the work).

> Combine Nori and Kuromoji DictionaryBuilder
> ---
>
> Key: LUCENE-8817
> URL: https://issues.apache.org/jira/browse/LUCENE-8817
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Namgyu Kim
>Priority: Major
>
> This issue is related to LUCENE-8816.
> Currently Nori and Kuromoji Analyzer use the same dictionary structure. 
> (MeCab)
>  If we make combine DictionaryBuilder, we can reduce the code size.
>  But this task may have a dependency on the language.
>  (like HEADER string in BinaryDictionary and CharacterDefinition, methods in 
> BinaryDictionaryWriter, ...)
>  On the other hand, there are many overlapped classes.
> The purpose of this patch is to provide users of Nori and Kuromoji with the 
> same system dictionary generator.
> It may take some time because there is a little workload.
>  The work will be based on the latest master, and if the LUCENE-8816 is 
> finished first, I will pull the latest code and proceed.



--
This message was sent by 

[jira] [Comment Edited] (LUCENE-8817) Combine Nori and Kuromoji DictionaryBuilder

2019-06-07 Thread Namgyu Kim (JIRA)


[ 
https://issues.apache.org/jira/browse/LUCENE-8817?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16858895#comment-16858895
 ] 

Namgyu Kim edited comment on LUCENE-8817 at 6/7/19 6:40 PM:


I share the current status.
 The merging is almost over and I need some discussion.

 

I thought several structures.

1. Save in tools of analysis-common module.
 It is simple, but I think MeCab is difficult to see as a feature of 
analysis-common.

2. Create tools folder in analysis and set mecab-tools module in there.
 analysis/tools ─ analysis-common-tools (to-be)
                       └ icu-tools (to-be)
                       └ mecab-tools
                       └ ...
 The problem with this is that the number of modules increases a lot because 
each tool is created as a module.

3. Create a module called mecab
 we can create a mecab module that is the starting point for merging nori and 
kuromoji.
 If we proceed in this direction, we will only have tools in src.

But this approach may not be easy to create the runnable jar.
 Because it will include the library.
 (ex: MecabAnalyzer, MecabTokenizer, ...)

4. Create a module called mecab-tools
 It's easy to develop, but there are other library modules in analysis.
 So something seems strange because it's only runnable-jar.

 

Number 2 seems to be the best, but I'm not sure yet.
 I would appreciate any comments.

 

I will go ahead if direction is set, but landing will be delayed a little.
 The reason is that the build system is going to change. (SOLR-13452)
 But if it does not matter, I will proceed.


was (Author: danmuzi):
I share the current status.
 The merge is almost over and I need some discussion.

 

I thought several structures.

1. Save in tools of analysis-common module.
 It is simple, but I think MeCab is difficult to see as a feature of 
analysis-common.

2. Create tools folder in analysis and set mecab-tools module in there.
 analysis/tools ─ analysis-common-tools (to-be)
                      └ icu-tools (to-be)
                      └ mecab-tools
                      └ ...
 The problem with this is that the number of modules increases a lot because 
each tool is created as a module.

3. Create a module called mecab
 we can create a mecab module that is the starting point for merging nori and 
kuromoji.
 If we proceed in this direction, we will only have tools in src.

But this approach may not be easy to create the runnable jar.
 Because it will include the library.
 (ex: MecabAnalyzer, MecabTokenizer, ...)

4. Create a module called mecab-tools
 It's easy to develop, but there are other library modules in analysis.
 So something seems strange because it's only runnable-jar.

 

Number 2 seems to be the best, but I'm not sure yet.
 I would appreciate any comments.

 

I will go ahead if direction is set, but landing will be delayed a little.
 The reason is that the build system is going to change. (SOLR-13452)
 But if it does not matter, I will proceed.

> Combine Nori and Kuromoji DictionaryBuilder
> ---
>
> Key: LUCENE-8817
> URL: https://issues.apache.org/jira/browse/LUCENE-8817
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Namgyu Kim
>Priority: Major
>
> This issue is related to LUCENE-8816.
> Currently Nori and Kuromoji Analyzer use the same dictionary structure. 
> (MeCab)
>  If we make combine DictionaryBuilder, we can reduce the code size.
>  But this task may have a dependency on the language.
>  (like HEADER string in BinaryDictionary and CharacterDefinition, methods in 
> BinaryDictionaryWriter, ...)
>  On the other hand, there are many overlapped classes.
> The purpose of this patch is to provide users of Nori and Kuromoji with the 
> same system dictionary generator.
> It may take some time because there is a little workload.
>  The work will be based on the latest master, and if the LUCENE-8816 is 
> finished first, I will pull the latest code and proceed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org