[jira] Commented: (NUTCH-36) Chinese in Nutch

Doug Cutting (JIRA) Mon, 11 Apr 2005 15:47:00 -0700

     [ http://issues.apache.org/jira/browse/NUTCH-36?page=comments#action_62604 
]
     
Doug Cutting commented on NUTCH-36:
-----------------------------------


I like what this patch does, but not how it does it.  Nutch should perform 
bi-gram segementation of CJK character sequences.  This patch performs such 
segmentation at two places: in the character stream that is the input to the 
tokenizer, and in a filter that processes the output of the tokenizer.  I'm 
unclear why the latter is required.  The former should suffice, no?

But instead of segmenting in the character stream it should be done in the 
tokenizer itself.  I think this could be done with something like the following 
in NutchAnalysis.jj.

| <SIGRAM: <CJK> >

{ if (prevCJK) {
    matchedToken.image = prevCJK + matchedToken.image;
  } else {
    matchedToken.image = "_" + matchedToken.image;
  }
}

A little more would be required to maintain prevCJK.

Thoughts?

> Chinese in Nutch
> ----------------
>
>          Key: NUTCH-36
>          URL: http://issues.apache.org/jira/browse/NUTCH-36
>      Project: Nutch
>         Type: Improvement
>   Components: indexer, searcher
>  Environment: all
>     Reporter: Jack Tang
>     Priority: Minor
>  Attachments: &#26700
>
> Nutch now support Chinese in very simple way: NutchAnalysis segments CJK term 
> word-by-word. 
> So, if I search Chinese term 'FooBar'(two Chinese words: 'Foo' and 'Bar'), 
> the result in web gui will highlight 'FooBar' and 'Foo', 'Bar'. While we 
> expect Nutch only highlights 'FooBar'.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-36) Chinese in Nutch

Reply via email to