[jira] Commented: (NUTCH-36) Chinese in Nutch

Jack Tang (JIRA) Mon, 04 Apr 2005 19:24:21 -0700

     [ http://issues.apache.org/jira/browse/NUTCH-36?page=comments#action_62153 
]
     
Jack Tang commented on NUTCH-36:
--------------------------------


Follow below steps to make Nutch support Chinese well.

1. Modify NutchAnalysis.jj. 
===========================================
@@ -106,7 +106,7 @@
     }
 
   // chinese, japanese and korean characters
-| <SIGRAM: <CJK> >
+| <SIGRAM: (<CJK>)+ >
===========================================

Why change "<SIGRAM:<CJK>>" to "<SIGRAM: (<CJK>)+>"? Because Chinese(I don't 
know japanese and korean well) terms segmentation is totally different from 
English. In another words, word-by-word segmentation is inefficient for Chinese 
characters indexing and search.


2. Modify FastCharStream.java
===========================================
@@ -18,6 +18,8 @@
 
 import java.io.*;
 
+import org.apache.lucene.analysis.Token;
+
 /** An efficient implementation of JavaCC's CharStream interface.  <p>Note that
  * this does not do line-number counting, but instead keeps track of the
  * character position of the token in the input, as required by Lucene's 
[EMAIL PROTECTED]
@@ -69,10 +71,15 @@
     if (charsRead == -1)
       throw new IOException("read past eof");
     else
-      bufferLength += charsRead;
+    {
+        charsRead = new CJKCharStream().readChineseChars(newPosition, 
charsRead);
+        bufferLength += charsRead;
+    }
   }
 
-  public final char BeginToken() throws IOException {
+  
+
+public final char BeginToken() throws IOException {
     tokenStart = bufferPosition;
     return readChar();
   }
@@ -117,4 +124,45 @@
   public final int getBeginLine() {
     return 1;
   }
+  
+  
+  final class CJKCharStream
+  {
+               
+       /**
+        * @param newPosition
+        * @param charsRead
+        * @return
+        * @throws IOException
+        */
+       int readChineseChars(int newPosition, int charsRead) 
+       throws IOException 
+       {
+               String str = new String(buffer,newPosition,charsRead);
+               CJKTokenizer tokenizer = new CJKTokenizer(new 
StringReader(str));
+               Token token = tokenizer.next();
+               StringBuffer sb = new StringBuffer();
+               while(token != null)
+               {
+                       sb.append(token.termText()).append(" ");
+                       token = tokenizer.next();
+                }
+                
+                
+                                
+                while(sb.length()>buffer.length-newPosition)
+                { 
+                         char[] newBuffer = new char[buffer.length*2];
+                         System.arraycopy(buffer, 0, newBuffer, 0, 
buffer.length);
+                         buffer = newBuffer;
+                }
+                
+                for(int i=0;i<sb.length();i++){
+                           buffer[newPosition+i]=sb.charAt(i);
+                }
+                
+                return sb.length();
+       }
+  }
+  
 }

To support "<SIGRAM: (<CJK>)+>" in NutchAnalysis.jj, we do Chinese term 
segmentation in FastCharStream which process before NutchAnalysis's parse 
method. And the main component is CJKTokenizer which Bi-segments Chinese terms.

3. Add CJKTokenizer.java

4. Modify NutchDocumentTokenizer.java
===========================================
@@ -46,8 +46,11 @@
         while (true) {
           t = tokenManager.getNextToken();
           switch (t.kind) {                       // skip query syntax tokens
-          case EOF: case WORD: case ACRONYM: case SIGRAM:
+          case EOF: case WORD: case ACRONYM:
             break loop;
+          case SIGRAM:
+               CJKTokenizer cjkT = new CJKTokenizer(input);
+               return cjkT.next();
           default:
           }
         }
===========================================
NutchDocumentTokenizer.tokenStream() is called by NutchDocumentAnalyzer, and 
int this way, the modified NutchDocumentTokenizer class let 
NutchDocumentAnalyzer supports Chinese.

> Chinese in Nutch
> ----------------
>
>          Key: NUTCH-36
>          URL: http://issues.apache.org/jira/browse/NUTCH-36
>      Project: Nutch
>         Type: Improvement
>   Components: indexer, searcher
>  Environment: all
>     Reporter: Jack Tang
>     Priority: Minor

>
> Nutch now support Chinese in very simple way: NutchAnalysis segments CJK term 
> word-by-word. 
> So, if I search Chinese term 'FooBar'(two Chinese words: 'Foo' and 'Bar'), 
> the result in web gui will highlight 'FooBar' and 'Foo', 'Bar'. While we 
> expect Nutch only highlights 'FooBar'.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
   http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-36) Chinese in Nutch

Reply via email to