[jira] Issue Comment Edited: (NUTCH-224) Nutch doesn't handle Korean text at all

Attila Pados (JIRA) Fri, 26 Mar 2010 12:10:49 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12850306#action_12850306
 ]


Attila Pados edited comment on NUTCH-224 at 3/26/10 7:08 PM:
-------------------------------------------------------------

Hi, 

 can we know, what is the status with korean language problems? There is also 
the patch 666, that should cover korean language problems, but that is for 
nutch 1.1 which is not released yet as far as i know. 

I was debugging for long the above problem, and found that in the 
NutchAnalysisTokenManager.jjCanMove_1 method, the korean utf value range is not 
covered. Therefore korean characters analised as whitespace, and omitted from 
the search query when the query object is generated. I made some nasty hacking, 
and placed the values for 2 korean characters into a jjbitVec8 constant, and a 
new entry for them in the switch, but i think this will leave out most of the 
korean characters. 

these are the additions: 
static final long[] jjbitVec8 = {0x0L, 0x0L, 0x0L, 0x2fffffffffffffL}; 
private static final boolean jjCanMove_1(int hiByte, int i1, int i2, long l1, 
long l2) 
{ 
   switch(hiByte) 
   { 
      case 48: 
         return ((jjbitVec4[i2] & l2) != 0L); 
      case 49: 
         return ((jjbitVec5[i2] & l2) != 0L); 
      case 51: 
         return ((jjbitVec6[i2] & l2) != 0L); 
      case 61: 
         return ((jjbitVec7[i2] & l2) != 0L); 
      case 172: /* <- new entry */ 
          return ((jjbitVec8[i2] & l2) != 0L); 
      case 201: /* <- new entry */ 
          return ((jjbitVec8[i2] & l2) != 0L); 
      default : 
         if ((jjbitVec3[i1] & l1) != 0L) 
            return true; 
         return false; 
   } 
}

Debug shows, that the query object now contains the correct characers, but it 
is not indexed on the page. 
This may be becuase the same problem is present somewhere in the 
crawler/indexer process too. 

      was (Author: padisah):
    Hi, 

 can we know, what is the status with korean language problems? There is also 
the patch 666, that should cover korean language problems, but that is for 
nutch 1.1 which is not released yet as far as i know. 

I was debugging for long the above problem, and found that in the 
NutchAnalysisTokenManager.jjCanMove_1 method, the korean utf value range is not 
covered. Therefore korean characters analised as whitespace, and omitted from 
the search query when the query object is generated. I made some nasty hacking, 
and placed the values for 2 korean characters into a jjbitVec8 constant, and a 
new entry for them in the switch, but i think this will leave out most of the 
korean characters. 

these are the additions: 
static final long[] jjbitVec8 = {0x0L, 0x0L, 0x0L, 0x2fffffffffffffL}; 
private static final boolean jjCanMove_1(int hiByte, int i1, int i2, long l1, 
long l2) 
{ 
   switch(hiByte) 
   { 
      case 48: 
         return ((jjbitVec4[i2] & l2) != 0L); 
      case 49: 
         return ((jjbitVec5[i2] & l2) != 0L); 
      case 51: 
         return ((jjbitVec6[i2] & l2) != 0L); 
      case 61: 
         return ((jjbitVec7[i2] & l2) != 0L); 
      case 172: /* <- new entry */ 
          return ((jjbitVec8[i2] & l2) != 0L); 
      case 201: /* <- new entry */ 
          return ((jjbitVec8[i2] & l2) != 0L); 
      default : 
         if ((jjbitVec3[i1] & l1) != 0L) 
            return true; 
         return false; 
   } 
} 

Debug shows, that the query object now contains the correct characers, but it 
is not indexed on the page. 
This may be becuase the same problem is present somewhere in the 
crawler/indexer process too. 
  
> Nutch doesn't handle Korean text at all
> ---------------------------------------
>
>                 Key: NUTCH-224
>                 URL: https://issues.apache.org/jira/browse/NUTCH-224
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 0.7.1
>            Reporter: KuroSaka TeruHiko
>
> I was browing NutchAnalysis.jj and found that
> Hungul Syllables (U+AC00 ... U+D7AF; U+xxxx means
> a Unicode character of the hex value xxxx) are not
> part of LETTER or CJK class.  This seems to me that
> Nutch cannot handle Korean documents at all.
> I posted the above message at nutch-user ML and Cheolgoo Kang 
> [app...@gmail.com]
> replied as:
> ------------------------------------------------------------------------------------
> There was similar issue with Lucene's StandardTokenizer.jj.
> http://issues.apache.org/jira/browse/LUCENE-444
> and
> http://issues.apache.org/jira/browse/LUCENE-461
> I'm have almost no experience with Nutch, but you can handle it like
> those issues above.
> ------------------------------------------------------------------------------------
> Both fixes should probably be ported back to NuatchAnalysis.jj.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (NUTCH-224) Nutch doesn't handle Korean text at all

Reply via email to