Fix for Japanese SEN morphological analyzer, and moving into Contrib

Mark Bennett Mon, 12 Oct 2009 10:47:07 -0700

Hi folks,

I've been working to fix the Japanese SEN morphological analyzer, which is
currently hosted at:
https://sen.dev.java.net


To review, Japanese doesn't use whitespace for word breaks.  The traditional
approach to CJK (Chinese, Japanese, Korean) is to use bigram character pairs
in the index.  While this works to a point, some believe that using proper
word breaks provides better results.

The "lucene-ja" glue layer between Lucene and the core SEN library broke in
May of '09 when a fix was made in Lucene:
http://issues.apache.org/jira/browse/LUCENE-1636

Uwe S. had a very good insight for a quick fix, and I have been cleaning up
some other issues with the code.  I have also spoken the author Takashi
Okamoto and he is fine to have this moved from java.net to ASF; I think it
will be easier for folks to find and use it if it's in ASF.

I'm not quite ready to submit a patch, but the Wiki suggests emailing the
list with the idea in advance.  There are some packaging questions I'll
have, there's actually quite a few parts.  Also, the wiki didn't quite spell
out the process to get things into contrib, beyond emailing and submitting a
patch.  I also plan to eventually submit a Solr-specific wrapper to the solr
dev list, to allow for dynamic config changes to be made from Solr's
schema.  But since the original code was Lucene based, and it provides the
broadest reach, I think having it in core Lucene would be a good start.

Any comments, suggestions, or mentor volunteers?  :-)

Mark

--
Mark Bennett / New Idea Engineering, Inc. / [email protected]
Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513

Fix for Japanese SEN morphological analyzer, and moving into Contrib

Reply via email to