[jira] Commented: (LUCENE-1728) Move SmartChineseAnalyzer & resources to own contrib project

Robert Muir (JIRA) Tue, 21 Jul 2009 02:20:41 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12733544#action_12733544
 ]


Robert Muir commented on LUCENE-1728:
-------------------------------------

Simon, I agree with you, there is a ton of work to be done. 

I also did not particularly like my method of moving everything into one 
package to hide the internals... and I 100% agree that a "correct" refactoring 
is quite a bit of work. 

I don't want to sound like a complainer since I don't have a patch to fix these 
things, but I want to list some things that I would like to fix/refactor also.
* removal of GB2312 dictionary dependency: this limits functionality to 
simplified chinese.
* use of unicode categories (java Character class, etc) versus 
Utility.getCharType()
* support for codepoints outside of BMP, this is necessary to support 
traditional chinese.
* a little more flexibility with tokenization, honestly I'm really not sold on 
indexing "words" for chinese in the first place. But words + bigrams 
(overlapping tokens), that would be nice.

In the future it would be nice to add support for traditional chinese, and 
there is frequency data out there (libtabe: BSD license, etc), but we need to 
refactor first.

As far as what to do for 2.9... I really don't know either, just let me know if 
you need a new patch :)


> Move SmartChineseAnalyzer & resources to own contrib project
> ------------------------------------------------------------
>
>                 Key: LUCENE-1728
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1728
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/analyzers
>            Reporter: Simon Willnauer
>            Assignee: Simon Willnauer
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1728.txt, LUCENE-1728.txt, LUCENE-1728.txt
>
>
> SmartChineseAnalyzer depends on  a large dictionary that causes the analyzer 
> jar to grow up to 3MB. The dictionary is quite big compared to all the other 
> resouces / class files contained in that jar. 
> Having a separate analyzer-cn contrib project enables footprint-sensitive 
> users (e.g. using lucene on a mobile phone) to include analyzer.jar without 
> getting into trouble with disk space.
> Moving SmartChineseAnalyzer to a separate project could also include a small 
> refactoring as Robert mentioned in 
> [LUCENE-1722|https://issues.apache.org/jira/browse/LUCENE-1722] several 
> classes should be package protected, members and classes could be final, 
> commented syserr and logging code should be removed etc.
> I set this issue target to 2.9 - if we can not make it until then feel free 
> to move it to 3.0

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1728) Move SmartChineseAnalyzer & resources to own contrib project

Reply via email to