[ https://issues.apache.org/jira/browse/LUCENE-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12733544#action_12733544 ]
Robert Muir commented on LUCENE-1728: ------------------------------------- Simon, I agree with you, there is a ton of work to be done. I also did not particularly like my method of moving everything into one package to hide the internals... and I 100% agree that a "correct" refactoring is quite a bit of work. I don't want to sound like a complainer since I don't have a patch to fix these things, but I want to list some things that I would like to fix/refactor also. * removal of GB2312 dictionary dependency: this limits functionality to simplified chinese. * use of unicode categories (java Character class, etc) versus Utility.getCharType() * support for codepoints outside of BMP, this is necessary to support traditional chinese. * a little more flexibility with tokenization, honestly I'm really not sold on indexing "words" for chinese in the first place. But words + bigrams (overlapping tokens), that would be nice. In the future it would be nice to add support for traditional chinese, and there is frequency data out there (libtabe: BSD license, etc), but we need to refactor first. As far as what to do for 2.9... I really don't know either, just let me know if you need a new patch :) > Move SmartChineseAnalyzer & resources to own contrib project > ------------------------------------------------------------ > > Key: LUCENE-1728 > URL: https://issues.apache.org/jira/browse/LUCENE-1728 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/analyzers > Reporter: Simon Willnauer > Assignee: Simon Willnauer > Priority: Minor > Fix For: 2.9 > > Attachments: LUCENE-1728.txt, LUCENE-1728.txt, LUCENE-1728.txt > > > SmartChineseAnalyzer depends on a large dictionary that causes the analyzer > jar to grow up to 3MB. The dictionary is quite big compared to all the other > resouces / class files contained in that jar. > Having a separate analyzer-cn contrib project enables footprint-sensitive > users (e.g. using lucene on a mobile phone) to include analyzer.jar without > getting into trouble with disk space. > Moving SmartChineseAnalyzer to a separate project could also include a small > refactoring as Robert mentioned in > [LUCENE-1722|https://issues.apache.org/jira/browse/LUCENE-1722] several > classes should be package protected, members and classes could be final, > commented syserr and logging code should be removed etc. > I set this issue target to 2.9 - if we can not make it until then feel free > to move it to 3.0 -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org