[jira] [Commented] (LUCENE-8869) Build kuromoji system dictionary as a separated jar and load it from JapaneseTokenizer at runtime

Tomoko Uchida (JIRA) Sun, 23 Jun 2019 03:03:15 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-8869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870515#comment-16870515
 ]


Tomoko Uchida commented on LUCENE-8869:
---------------------------------------

As a first step, I moved dictionary data (dat files) to a separated jar on my 
local branch.
https://github.com/mocobeta/lucene-solr-mirror/commit/9def2b22f4e7467bef72edfac84c9f74f67289aa

In order to build and ship two jars (one for kuromoji analyzer, one for the 
system dictionary), I slightly changed the directory structure:

{code}
analysis/kuromoji/
├── build.xml
├── ivy.xml
├── src
│     ├── java
│     │     ├── org
│     │     └── overview.html
│     ├── resources
│     │     ├── META-INF
│     │     └── org
│     ├── test
│     │     └── org
│     └── tools
│           ├── java
│           ├── patches
│           └── test
└── sysdic
        └── src
              └── resources
{code}

Here, {{sysdic}} directory is added and all dat files are placed to 
{{sysdic/src/resources}} instead of {{src/resources}} by the {{build-dict}} 
task.

On the JapaneseTokenizer side, currently it holds all dictionary data within 
static singleton fields, we need to make it possible to flexibly load the 
dictionary data from a jar or a directory path (for testing purpose) when 
initializing a tokenizer so that users can choice arbitrary dictionary at 
runtime.

> Build kuromoji system dictionary as a separated jar and load it from 
> JapaneseTokenizer at runtime
> -------------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-8869
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8869
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/analysis
>            Reporter: Tomoko Uchida
>            Priority: Major
>
> This is a sub-task for LUCENE-8816.
>  In this issue, I will try to make small but self-contained changes to 
> kuromoji system dictionary.
>  - Make it possible to build a jar that contains (maybe) only dictionary data 
> resource generated by the {{build-dict}} task.
>  -- Maybe a new ant target will be added.
>  - Make it possible to load external dictionary when initializing 
> JapaneseTokenizer.
>  -- Some work are already done on LUCENE-8863
>  - Decouple current system dictionary data (mecab ipadic) from kuromoji 
> itself and use it as default (Possibly it can be done with another issue).
> Also, some refactoring of the directory/source tree structure may be needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8869) Build kuromoji system dictionary as a separated jar and load it from JapaneseTokenizer at runtime

Reply via email to