Hi folks, I've been working to fix the Japanese SEN morphological analyzer, which is currently hosted at: https://sen.dev.java.net
To review, Japanese doesn't use whitespace for word breaks. The traditional approach to CJK (Chinese, Japanese, Korean) is to use bigram character pairs in the index. While this works to a point, some believe that using proper word breaks provides better results. The "lucene-ja" glue layer between Lucene and the core SEN library broke in May of '09 when a fix was made in Lucene: http://issues.apache.org/jira/browse/LUCENE-1636 Uwe S. had a very good insight for a quick fix, and I have been cleaning up some other issues with the code. I have also spoken the author Takashi Okamoto and he is fine to have this moved from java.net to ASF; I think it will be easier for folks to find and use it if it's in ASF. I'm not quite ready to submit a patch, but the Wiki suggests emailing the list with the idea in advance. There are some packaging questions I'll have, there's actually quite a few parts. Also, the wiki didn't quite spell out the process to get things into contrib, beyond emailing and submitting a patch. I also plan to eventually submit a Solr-specific wrapper to the solr dev list, to allow for dynamic config changes to be made from Solr's schema. But since the original code was Lucene based, and it provides the broadest reach, I think having it in core Lucene would be a good start. Any comments, suggestions, or mentor volunteers? :-) Mark -- Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513