Mark, I agree with what you said, it would be great if there was a way to easily enable this japanese support.
I will let someone else comment on the licensing, but I since you mentioned source dictionaries, thought Sen only used IPA dic for its data? I could be wrong on this. I think its a BSD-like license, here you can read the license as google chrome prints it... (in a separate really interesting dictionary for CJ segmentation) http://src.chromium.org/viewvc/chrome/trunk/deps/third_party/icu38/source/data/brkitr/cjdict.txt On Mon, Oct 12, 2009 at 2:58 PM, Mark Bennett <mbenn...@ideaeng.com> wrote: > Hello Robert, > > That's a good question. The core SEN is under LGPL, yes. However, I > didn't need to make changes to that, though given that there are 2 versions > floating around, I think it needs a good home. > > But the glue-layer is under "Apache 2.0" license, and that's the part that > needed fixing. I think that means it's ASF / contrib compatible? > > There are also 2 other ancillary libraries and some source dictionaries > which I need to research. > > Working from the other direction, which might give you some ideas: > The goal is to get this more accessible. It'd be really nice if, in a > Lucene distribution, the SEN library could be switched on as easily as CJK. > Or at the most you'd run an ant script to fetch all the parts and assemble > it. As it stands now I think it's not used much because it's a bit complex > to setup, even prior to May '09's change, and most of the users of it > discuss it in Japanese. So that's the goal, I'm very open to ideas on the > "how". > > Mark > > -- > Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com > Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513 > > > On Mon, Oct 12, 2009 at 11:10 AM, Robert Muir <rcm...@gmail.com> wrote: > >> Mark, does this mean Sen will be under the Apache license? (it is >> currently LGPL) >> >> >> On Mon, Oct 12, 2009 at 1:46 PM, Mark Bennett <mbenn...@ideaeng.com>wrote: >> >>> Hi folks, >>> >>> I've been working to fix the Japanese SEN morphological analyzer, which >>> is currently hosted at: >>> https://sen.dev.java.net >>> >>> To review, Japanese doesn't use whitespace for word breaks. The >>> traditional approach to CJK (Chinese, Japanese, Korean) is to use bigram >>> character pairs in the index. While this works to a point, some believe >>> that using proper word breaks provides better results. >>> >>> The "lucene-ja" glue layer between Lucene and the core SEN library broke >>> in May of '09 when a fix was made in Lucene: >>> http://issues.apache.org/jira/browse/LUCENE-1636 >>> >>> Uwe S. had a very good insight for a quick fix, and I have been cleaning >>> up some other issues with the code. I have also spoken the author Takashi >>> Okamoto and he is fine to have this moved from java.net to ASF; I think >>> it will be easier for folks to find and use it if it's in ASF. >>> >>> I'm not quite ready to submit a patch, but the Wiki suggests emailing the >>> list with the idea in advance. There are some packaging questions I'll >>> have, there's actually quite a few parts. Also, the wiki didn't quite spell >>> out the process to get things into contrib, beyond emailing and submitting a >>> patch. I also plan to eventually submit a Solr-specific wrapper to the solr >>> dev list, to allow for dynamic config changes to be made from Solr's >>> schema. But since the original code was Lucene based, and it provides the >>> broadest reach, I think having it in core Lucene would be a good start. >>> >>> Any comments, suggestions, or mentor volunteers? :-) >>> >>> Mark >>> >>> -- >>> Mark Bennett / New Idea Engineering, Inc. / mbenn...@ideaeng.com >>> Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513 >>> >> >> >> >> -- >> Robert Muir >> rcm...@gmail.com >> > > -- Robert Muir rcm...@gmail.com