[ http://issues.apache.org/jira/browse/NUTCH-36?page=comments#action_62153
]
Jack Tang commented on NUTCH-36:
--------------------------------
Follow below steps to make Nutch support Chinese well.
1. Modify NutchAnalysis.jj.
===========================================
@@ -106,7 +106,7 @@
}
// chinese, japanese and korean characters
-| <SIGRAM: <CJK> >
+| <SIGRAM: (<CJK>)+ >
===========================================
Why change "<SIGRAM:<CJK>>" to "<SIGRAM: (<CJK>)+>"? Because Chinese(I don't
know japanese and korean well) terms segmentation is totally different from
English. In another words, word-by-word segmentation is inefficient for Chinese
characters indexing and search.
2. Modify FastCharStream.java
===========================================
@@ -18,6 +18,8 @@
import java.io.*;
+import org.apache.lucene.analysis.Token;
+
/** An efficient implementation of JavaCC's CharStream interface. <p>Note that
* this does not do line-number counting, but instead keeps track of the
* character position of the token in the input, as required by Lucene's
[EMAIL PROTECTED]
@@ -69,10 +71,15 @@
if (charsRead == -1)
throw new IOException("read past eof");
else
- bufferLength += charsRead;
+ {
+ charsRead = new CJKCharStream().readChineseChars(newPosition,
charsRead);
+ bufferLength += charsRead;
+ }
}
- public final char BeginToken() throws IOException {
+
+
+public final char BeginToken() throws IOException {
tokenStart = bufferPosition;
return readChar();
}
@@ -117,4 +124,45 @@
public final int getBeginLine() {
return 1;
}
+
+
+ final class CJKCharStream
+ {
+
+ /**
+ * @param newPosition
+ * @param charsRead
+ * @return
+ * @throws IOException
+ */
+ int readChineseChars(int newPosition, int charsRead)
+ throws IOException
+ {
+ String str = new String(buffer,newPosition,charsRead);
+ CJKTokenizer tokenizer = new CJKTokenizer(new
StringReader(str));
+ Token token = tokenizer.next();
+ StringBuffer sb = new StringBuffer();
+ while(token != null)
+ {
+ sb.append(token.termText()).append(" ");
+ token = tokenizer.next();
+ }
+
+
+
+ while(sb.length()>buffer.length-newPosition)
+ {
+ char[] newBuffer = new char[buffer.length*2];
+ System.arraycopy(buffer, 0, newBuffer, 0,
buffer.length);
+ buffer = newBuffer;
+ }
+
+ for(int i=0;i<sb.length();i++){
+ buffer[newPosition+i]=sb.charAt(i);
+ }
+
+ return sb.length();
+ }
+ }
+
}
To support "<SIGRAM: (<CJK>)+>" in NutchAnalysis.jj, we do Chinese term
segmentation in FastCharStream which process before NutchAnalysis's parse
method. And the main component is CJKTokenizer which Bi-segments Chinese terms.
3. Add CJKTokenizer.java
4. Modify NutchDocumentTokenizer.java
===========================================
@@ -46,8 +46,11 @@
while (true) {
t = tokenManager.getNextToken();
switch (t.kind) { // skip query syntax tokens
- case EOF: case WORD: case ACRONYM: case SIGRAM:
+ case EOF: case WORD: case ACRONYM:
break loop;
+ case SIGRAM:
+ CJKTokenizer cjkT = new CJKTokenizer(input);
+ return cjkT.next();
default:
}
}
===========================================
NutchDocumentTokenizer.tokenStream() is called by NutchDocumentAnalyzer, and
int this way, the modified NutchDocumentTokenizer class let
NutchDocumentAnalyzer supports Chinese.
> Chinese in Nutch
> ----------------
>
> Key: NUTCH-36
> URL: http://issues.apache.org/jira/browse/NUTCH-36
> Project: Nutch
> Type: Improvement
> Components: indexer, searcher
> Environment: all
> Reporter: Jack Tang
> Priority: Minor
>
> Nutch now support Chinese in very simple way: NutchAnalysis segments CJK term
> word-by-word.
> So, if I search Chinese term 'FooBar'(two Chinese words: 'Foo' and 'Bar'),
> the result in web gui will highlight 'FooBar' and 'Foo', 'Bar'. While we
> expect Nutch only highlights 'FooBar'.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
If you want more information on JIRA, or have a bug to report see:
http://www.atlassian.com/software/jira