[Nutch-dev] Re: [jira] Commented: (NUTCH-36) Chinese in Nutch

Jack Tang Mon, 11 Apr 2005 19:06:20 -0700

Cutting

I agree with you!
All segmentation of the character stream should be done in NutchAnalysis.jj.


More, here are something wrong in my solution. I feel so so so sorry
about my "impulsive" patch. I found it some days ago, and I am working
on it.
In my project I just replace my CJKAnalyzer with ContentAnalyzer in
NutchDocumentAnalyzer.

Here is the reason what I got:
Say CJK character sequences "C1C2C3C4" ("C1" here means one CJK
character), passed through bi-gram segementation, the result should be
"C1C2"(0,2), "C2C3"(1,3), "C3C4"(2,4).[NOTE: first number in bracket
is token's start offset and the second one is end offset] In another
words, the bi-gram segmented terms should merged when they return new
Token. And the known in my solution is that the postion of tokens are
totally wrong, like "C1C2"(0,2), "C2C3"(3,5), "C3C4"(6,8). So, it is
crashed when the search summary show.

/Jack

On Apr 12, 2005 6:20 AM, Doug Cutting (JIRA) <[EMAIL PROTECTED]> wrote:
>     [ 
> http://issues.apache.org/jira/browse/NUTCH-36?page=comments#action_62604 ]
> 
> Doug Cutting commented on NUTCH-36:
> -----------------------------------
> 
> I like what this patch does, but not how it does it.  Nutch should perform 
> bi-gram segementation of CJK character sequences.  This patch performs such 
> segmentation at two places: in the character stream that is the input to the 
> tokenizer, and in a filter that processes the output of the tokenizer.  I'm 
> unclear why the latter is required.  The former should suffice, no?
> 
> But instead of segmenting in the character stream it should be done in the 
> tokenizer itself.  I think this could be done with something like the 
> following in NutchAnalysis.jj.
> 
> | <SIGRAM: <CJK> >
> 
> { if (prevCJK) {
>    matchedToken.image = prevCJK + matchedToken.image;
>  } else {
>    matchedToken.image = "_" + matchedToken.image;
>  }
> }
> 
> A little more would be required to maintain prevCJK.
> 
> Thoughts?
> 
> > Chinese in Nutch
> > ----------------
> >
> >          Key: NUTCH-36
> >          URL: http://issues.apache.org/jira/browse/NUTCH-36
> >      Project: Nutch
> >         Type: Improvement
> >   Components: indexer, searcher
> >  Environment: all
> >     Reporter: Jack Tang
> >     Priority: Minor
> >  Attachments: &#26700
> >
> > Nutch now support Chinese in very simple way: NutchAnalysis segments CJK 
> > term word-by-word.
> > So, if I search Chinese term 'FooBar'(two Chinese words: 'Foo' and 'Bar'), 
> > the result in web gui will highlight 'FooBar' and 'Foo', 'Bar'. While we 
> > expect Nutch only highlights 'FooBar'.
> 
> --
> This message is automatically generated by JIRA.
> -
> If you think it was sent incorrectly contact one of the administrators:
>   http://issues.apache.org/jira/secure/Administrators.jspa
> -
> If you want more information on JIRA, or have a bug to report see:
>   http://www.atlassian.com/software/jira
> 
>


-------------------------------------------------------
SF email is sponsored by - The IT Product Guide
Read honest & candid reviews on hundreds of IT Products from real users.
Discover which products truly live up to the hype. Start reading now.
http://ads.osdn.com/?ad_id=6595&alloc_id=14396&op=click
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

[Nutch-dev] Re: [jira] Commented: (NUTCH-36) Chinese in Nutch

Reply via email to