[
https://issues.apache.org/jira/browse/LUCENE-8869?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16870515#comment-16870515
]
Tomoko Uchida commented on LUCENE-8869:
---------------------------------------
As a first step, I moved dictionary data (dat files) to a separated jar on my
local branch.
https://github.com/mocobeta/lucene-solr-mirror/commit/9def2b22f4e7467bef72edfac84c9f74f67289aa
In order to build and ship two jars (one for kuromoji analyzer, one for the
system dictionary), I slightly changed the directory structure:
{code}
analysis/kuromoji/
├── build.xml
├── ivy.xml
├── src
│ ├── java
│ │ ├── org
│ │ └── overview.html
│ ├── resources
│ │ ├── META-INF
│ │ └── org
│ ├── test
│ │ └── org
│ └── tools
│ ├── java
│ ├── patches
│ └── test
└── sysdic
└── src
└── resources
{code}
Here, {{sysdic}} directory is added and all dat files are placed to
{{sysdic/src/resources}} instead of {{src/resources}} by the {{build-dict}}
task.
On the JapaneseTokenizer side, currently it holds all dictionary data within
static singleton fields, we need to make it possible to flexibly load the
dictionary data from a jar or a directory path (for testing purpose) when
initializing a tokenizer so that users can choice arbitrary dictionary at
runtime.
> Build kuromoji system dictionary as a separated jar and load it from
> JapaneseTokenizer at runtime
> -------------------------------------------------------------------------------------------------
>
> Key: LUCENE-8869
> URL: https://issues.apache.org/jira/browse/LUCENE-8869
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/analysis
> Reporter: Tomoko Uchida
> Priority: Major
>
> This is a sub-task for LUCENE-8816.
> In this issue, I will try to make small but self-contained changes to
> kuromoji system dictionary.
> - Make it possible to build a jar that contains (maybe) only dictionary data
> resource generated by the {{build-dict}} task.
> -- Maybe a new ant target will be added.
> - Make it possible to load external dictionary when initializing
> JapaneseTokenizer.
> -- Some work are already done on LUCENE-8863
> - Decouple current system dictionary data (mecab ipadic) from kuromoji
> itself and use it as default (Possibly it can be done with another issue).
> Also, some refactoring of the directory/source tree structure may be needed.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]