I'd like to make the JapaneseTokenizer a little more flexible by allowing Solr users to supply their own dictionary via the JapaneseTokenizerFactory. We looked into using the existing User Dictionary functionality, but it didn't suit our use case. We compiled our own dictionary, and had to do a bit of a kludge to get the JapaneseTokenizer to recognize it.
Here's what I'd like to do. * Publish the Kuromoji tools as a JAR to Maven * Refactor the various Dictionary classes to optionally allow loading a dictionary from the file system instead of hard coding loading the dictionary from the classpath. If no file system location is provided, fall back to the default dictionary on the classpath. * Move instantiation of the dictionary to the JapaneseTokenizerFactory and pass the dictionary in as a parameter to the JapaneseTokenizer constructor I've looked into the code, and this seems like a manageable change, but I want to make sure I'm not breaking anything. Currently the Dictionary classes maintain their own singleton instances in static variables. It seems to me, it might be better if the JapaneseTokenizerFactory were to hold on to an instance of the dictionary and pass this in to any JapaneseTokenizer created. Was there a reason for using a singleton pattern for the various Dictionary classes, or can this be changed? Is there any objection to publishing the Kuromoji tools to Maven? They were very easy to compile and use. Packaging them up as a JAR file was simple, but I will need a bit of direction as to how to do this within the current conventions for Lucene's build.xml files. - Hayden