[
https://issues.apache.org/jira/browse/LUCENE-1728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12733544#action_12733544
]
Robert Muir commented on LUCENE-1728:
-------------------------------------
Simon, I agree with you, there is a ton of work to be done.
I also did not particularly like my method of moving everything into one
package to hide the internals... and I 100% agree that a "correct" refactoring
is quite a bit of work.
I don't want to sound like a complainer since I don't have a patch to fix these
things, but I want to list some things that I would like to fix/refactor also.
* removal of GB2312 dictionary dependency: this limits functionality to
simplified chinese.
* use of unicode categories (java Character class, etc) versus
Utility.getCharType()
* support for codepoints outside of BMP, this is necessary to support
traditional chinese.
* a little more flexibility with tokenization, honestly I'm really not sold on
indexing "words" for chinese in the first place. But words + bigrams
(overlapping tokens), that would be nice.
In the future it would be nice to add support for traditional chinese, and
there is frequency data out there (libtabe: BSD license, etc), but we need to
refactor first.
As far as what to do for 2.9... I really don't know either, just let me know if
you need a new patch :)
> Move SmartChineseAnalyzer & resources to own contrib project
> ------------------------------------------------------------
>
> Key: LUCENE-1728
> URL: https://issues.apache.org/jira/browse/LUCENE-1728
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/analyzers
> Reporter: Simon Willnauer
> Assignee: Simon Willnauer
> Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1728.txt, LUCENE-1728.txt, LUCENE-1728.txt
>
>
> SmartChineseAnalyzer depends on a large dictionary that causes the analyzer
> jar to grow up to 3MB. The dictionary is quite big compared to all the other
> resouces / class files contained in that jar.
> Having a separate analyzer-cn contrib project enables footprint-sensitive
> users (e.g. using lucene on a mobile phone) to include analyzer.jar without
> getting into trouble with disk space.
> Moving SmartChineseAnalyzer to a separate project could also include a small
> refactoring as Robert mentioned in
> [LUCENE-1722|https://issues.apache.org/jira/browse/LUCENE-1722] several
> classes should be package protected, members and classes could be final,
> commented syserr and logging code should be removed etc.
> I set this issue target to 2.9 - if we can not make it until then feel free
> to move it to 3.0
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]