is it possible?
well, in eclipse it succeeded. i added some encoding code in Content.java using
HtmlParser (a plugin). it workes succesfully in eclipse (I have tested using
SegmentReader only, not any unit tests though).
but when compiling using ant I get compile errors.
here is the modification in Content.java in nutch-0.9.tar.gz release version
(not trunk)
I have replaced the line:
buffer.append(new String(content)); // try default encoding
with
Configuration conf = NutchConfiguration.create();
HtmlParser parser = new HtmlParser();
parser.setConf(conf);
Parse parse = parser.getParse( this );
String
encoding=parse.getData().getParseMeta().get("OriginalCharEncoding");
String localEncodedString="java incompatible encoding";
try{
localEncodedString = new String(content,encoding);
}
catch(Exception e){
e.printStackTrace();
}
buffer.append(localEncodedString);
here is the compile errors;
compile-core:
[javac] Compiling 165 source files to /home/onur/nutch-0.9/build/classes
[javac]
/home/onur/nutch-0.9/src/java/org/apache/nutch/protocol/Content.java:39:
package org.apache.nutch.parse.html does not exist
[javac] import org.apache.nutch.parse.html.HtmlParser;
[javac] ^
[javac]
/home/onur/nutch-0.9/src/java/org/apache/nutch/protocol/Content.java:240:
cannot find symbol
[javac] symbol : class HtmlParser
[javac] location: class org.apache.nutch.protocol.Content
[javac] HtmlParser parser = new HtmlParser();
[javac] ^
[javac]
/home/onur/nutch-0.9/src/java/org/apache/nutch/protocol/Content.java:240:
cannot find symbol
[javac] symbol : class HtmlParser
[javac] location: class org.apache.nutch.protocol.Content
[javac] HtmlParser parser = new HtmlParser();
[javac] ^
[javac] Note: Some input files use or override a deprecated API.
[javac] Note: Recompile with -Xlint:deprecation for details.
[javac] Note: Some input files use unchecked or unsafe operations.
[javac] Note: Recompile with -Xlint:unchecked for details.
[javac] 3 errors
BUILD FAILED
/home/onur/nutch-0.9/build.xml:106: Compile failed; see the compiler error
output for details.
do I need to make any other configuration to fix it? (parse-html exists in
nutch-default.xml plugin.includes property, i tried also adding it in
nutch-site.xml, but did not work)
or it is not intended to use plugins in core code?
any ideas?
(by the way what I'm trying to do here is to enable encoding in -get
functionality.. it normally gives content in platform-default encoding (utf-8) )
thanks
onur deniz