I'd like to make the JapaneseTokenizer a little more flexible by allowing
Solr users to supply their own dictionary via the JapaneseTokenizerFactory.
We looked into using the existing User Dictionary functionality, but it
didn't suit our use case. We compiled our own dictionary, and had to do a
bit of a kludge to get the JapaneseTokenizer to recognize it.

Here's what I'd like to do.

* Publish the Kuromoji tools as a JAR to Maven
* Refactor the various Dictionary classes to optionally allow loading a
dictionary from the file system instead of hard coding loading the
dictionary from the classpath. If no file system location is provided, fall
back to the default dictionary on the classpath.
* Move instantiation of the dictionary to the JapaneseTokenizerFactory and
pass the dictionary in as a parameter to the JapaneseTokenizer constructor

I've looked into the code, and this seems like a manageable change, but I
want to make sure I'm not breaking anything.

Currently the Dictionary classes maintain their own singleton instances in
static variables. It seems to me, it might be better if the
JapaneseTokenizerFactory were to hold on to an instance of the dictionary
and pass this in to any JapaneseTokenizer created. Was there a reason for
using a singleton pattern for the various Dictionary classes, or can this
be changed?

Is there any objection to publishing the Kuromoji tools to Maven? They were
very easy to compile and use. Packaging them up as a JAR file was simple,
but I will need a bit of direction as to how to do this within the current
conventions for Lucene's build.xml files.

- Hayden

Reply via email to