[ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13063928#comment-13063928
 ] 

Christian Moen commented on LUCENE-3305:
----------------------------------------

Thanks, Uwe!

I think we definitely should work together and combine the great work that 
Robert, Koji & co. have been doing on Lucene-GoSen with Kuromoji to make a 
highly attractive Japanese linguistics offering that is also an integrated part 
of Lucene/Solr.

The attributes do indeed look very nice -- excellent job!  I have several 
improvements in mind for Kuromoji (and other Japanese related code) and I'm 
looking forward to working with you to improve some of these things.

Additional to its license, an issue with GoSen (and Sen) used to be its 
segmentation quality.  To my knowledge, these analyzers still don't support 
so-called "unknown words" which means that words that are not in the 
dictionaries are treated second-rate, which impacts negatively on segmentation 
quality.





> Kuromoji code donation - a new Japanese morphological analyzer
> --------------------------------------------------------------
>
>                 Key: LUCENE-3305
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3305
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Christian Moen
>         Attachments: Kuromoji short overview .pdf, kuromoji-0.7.6-asf.tar.gz, 
> kuromoji-0.7.6.tar.gz, kuromoji-solr-0.5.3-asf.tar.gz, 
> kuromoji-solr-0.5.3.tar.gz
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to