[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

Robert Muir (Commented) (JIRA) Fri, 11 Nov 2011 17:57:19 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-3305?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13148933#comment-13148933
 ]


Robert Muir commented on LUCENE-3305:
-------------------------------------

looks like we want to add the Lucene analyzer/tokenizer and solr factories from 
kuromoji-solr-0.5.3-asf.tar.gz

I'd say once we get stuff going, maybe just download the dictionary, build it, 
and when committing commit
the built dictionary under resources/ folder (this is where the script puts it).

I think for this kind of feature it might be hard to iterate with patches, we 
should maybe try to get it 
in SVN (trunk) initially and iterate with smaller issues. The code looks pretty 
clean to me already.

The produced jar file is somewhat large but I think its still reasonable, so I 
think we should look past
this for now? working with Sen before I know some ways we can shrink this a 
lot, but that would be best
on a future issue.

Some java6 apis are here (e.g. unicode normalization). Christian can you 
confirm this is only for the 
dictionary-build stage? It looked to me like its only needed for ipadic/unidic 
parsing, but not
custom dictionary support.

If its only for the build stage, personally I think thats fine for 3.x too, 
because I'm suggesting we 
commit a 'built' dictionary and we tell people if they want to compile the 
dictionary themselves they 
need java6? We could put the dictionary-building under a tools/ directory thats 
java6-only, or we could 
depend on ICU for just the tools/ piece (i think we already have such hacks for 
generating jflex rules
for StandardTokenizer) and be fine on java5.

+1 for the GraphVizFormatter... 

                
> Kuromoji code donation - a new Japanese morphological analyzer
> --------------------------------------------------------------
>
>                 Key: LUCENE-3305
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3305
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Christian Moen
>            Assignee: Simon Willnauer
>         Attachments: Kuromoji short overview .pdf, LUCENE-3305.patch, 
> ip-clearance-Kuromoji.xml, ip-clearance-Kuromoji.xml, 
> kuromoji-0.7.6-asf.tar.gz, kuromoji-0.7.6.tar.gz, 
> kuromoji-solr-0.5.3-asf.tar.gz, kuromoji-solr-0.5.3.tar.gz
>
>
> Atilika Inc. (アティリカ株式会社) would like to donate the Kuromoji Japanese 
> morphological analyzer to the Apache Software Foundation in the hope that it 
> will be useful to Lucene and Solr users in Japan and elsewhere.
> The project was started in 2010 since we couldn't find any high-quality, 
> actively maintained and easy-to-use Java-based Japanese morphological 
> analyzers, and these become many of our design goals for Kuromoji.
> Kuromoji also has a segmentation mode that is particularly useful for search, 
> which we hope will interest Lucene and Solr users.  Compound-nouns, such as 
> 関西国際空港 (Kansai International Airport) and 日本経済新聞 (Nikkei Newspaper), are 
> segmented as one token with most analyzers.  As a result, a search for 空港 
> (airport) or 新聞 (newspaper) will not give you a for in these words.  Kuromoji 
> can segment these words into 関西 国際 空港 and 日本 経済 新聞, which is generally what 
> you would want for search and you'll get a hit.
> We also wanted to make sure the technology has a license that makes it 
> compatible with other Apache Software Foundation software to maximize its 
> usefulness.  Kuromoji has an Apache License 2.0 and all code is currently 
> owned by Atilika Inc.  The software has been developed by my good friend and 
> ex-colleague Masaru Hasegawa and myself.
> Kuromoji uses the so-called IPADIC for its dictionary/statistical model and 
> its license terms are described in NOTICE.txt.
> I'll upload code distributions and their corresponding hashes and I'd very 
> much like to start the code grant process.  I'm also happy to provide patches 
> to integrate Kuromoji into the codebase, if you prefer that.
> Please advise on how you'd like me to proceed with this.  Thank you.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-3305) Kuromoji code donation - a new Japanese morphological analyzer

Reply via email to