[ http://issues.apache.org/jira/browse/NUTCH-36?page=comments#action_62604
]
Doug Cutting commented on NUTCH-36:
-----------------------------------
I like what this patch does, but not how it does it. Nutch should perform
bi-gram segementation of CJK character sequences. This patch performs such
segmentation at two places: in the character stream that is the input to the
tokenizer, and in a filter that processes the output of the tokenizer. I'm
unclear why the latter is required. The former should suffice, no?
But instead of segmenting in the character stream it should be done in the
tokenizer itself. I think this could be done with something like the following
in NutchAnalysis.jj.
| <SIGRAM: <CJK> >
{ if (prevCJK) {
matchedToken.image = prevCJK + matchedToken.image;
} else {
matchedToken.image = "_" + matchedToken.image;
}
}
A little more would be required to maintain prevCJK.
Thoughts?
> Chinese in Nutch
> ----------------
>
> Key: NUTCH-36
> URL: http://issues.apache.org/jira/browse/NUTCH-36
> Project: Nutch
> Type: Improvement
> Components: indexer, searcher
> Environment: all
> Reporter: Jack Tang
> Priority: Minor
> Attachments: 桌
>
> Nutch now support Chinese in very simple way: NutchAnalysis segments CJK term
> word-by-word.
> So, if I search Chinese term 'FooBar'(two Chinese words: 'Foo' and 'Bar'),
> the result in web gui will highlight 'FooBar' and 'Foo', 'Bar'. While we
> expect Nutch only highlights 'FooBar'.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
http://www.atlassian.com/software/jira