[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

Uwe Schindler (JIRA) Thu, 17 Oct 2013 14:41:02 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-4956?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13798463#comment-13798463
 ]


Uwe Schindler commented on LUCENE-4956:
---------------------------------------

Hi [~soomyung],

thanks for the clarification. It was not [~jkrupan] that mentioned the GPL 
violation, it was Robert and me. I am glad that you are aware of this and you 
are trying to clarify this. Indeed the License of this file is hard to find 
out, because the Gnutella one (which is the original) has no license header. 
But the whole Gnutella project is GPL licensed. Those people also started to 
donate this code to Google Guava and wanted to relicense to ASF2, but this is 
not yet done. So we cannot use this code. The missing License header may be the 
reason for the Blackduck test to be happy.

[~cm] offered to donate a PatriciaTrie he wrote himself. Maybe we can replace 
the gnutella one by this one. I would prefer the solution to not use a trie at 
all. Instead we should use Lucene's FTS feature and bundle the whole dictionary 
as a serialized FST (like kuromoji does).

About the other copypasted code: I already removed all commons-io and 
commons-lang stuff. Commons-io was completely unneeded, because the resource 
handling to load resources from JAR files was not very good and can be done 
much easier by a simple Class#getResourceAsStream. I already implemented that 
and moved some class around, so be sure to update your svn before working more 
on the module.

I also removed the \u-escaping from the mapHanja.dic file, so I was able to 
remove the StringEscapeUtil class, which did too much unescaping (not only \u, 
also \n, \t,...)! But we should really check the license of this file or create 
a new one from Unicode tables. I left the file in SVN (converted to plain 
UTF-8) for now.

I am currently working on rewriting some code that creates too many small 
objects like strings all the time, because this slows down indexing! E.g. 
HanjaUtils should not use a String just to lookup a single char in a map. There 
are better data structures to hold the mapHanja table.

We should also not use readLines() to load all dictionaries into heap, then use 
an iterator over it and convert them to something else. We should use a 
BufferedReader and read it line by line and do the processing directly.

> the korean analyzer that has a korean morphological analyzer and dictionaries
> -----------------------------------------------------------------------------
>
>                 Key: LUCENE-4956
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4956
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>    Affects Versions: 4.2
>            Reporter: SooMyung Lee
>            Assignee: Christian Moen
>              Labels: newbie
>         Attachments: eval.patch, kr.analyzer.4x.tar, lucene-4956.patch, 
> lucene4956.patch, LUCENE-4956.patch
>
>
> Korean language has specific characteristic. When developing search service 
> with lucene & solr in korean, there are some problems in searching and 
> indexing. The korean analyer solved the problems with a korean morphological 
> anlyzer. It consists of a korean morphological analyzer, dictionaries, a 
> korean tokenizer and a korean filter. The korean anlyzer is made for lucene 
> and solr. If you develop a search service with lucene in korean, It is the 
> best idea to choose the korean analyzer.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-4956) the korean analyzer that has a korean morphological analyzer and dictionaries

Reply via email to