I tried to compile the trunk (version 579849) and it complained about
HtmlParser. Basically, the 4th argument to the String constructor on
line 84 should have been a string, not a Charset. Anyway, I made the
change but I can't check it back in so here is the diff:
Index:
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
===================================================================
--- src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
(revision 579846)
+++ src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
(working copy)
@@ -81,7 +81,12 @@
// to just inflate each byte to a 16-bit value by padding.
// For instance, the sequence {0x41, 0x82, 0xb7} will be turned into
// {U+0041, U+0082, U+00B7}.
- String str = new String(content, 0, length, Charset.forName("ASCII"));
+ String str = "";
+ try {
+ str = new String(content, 0, length,
Charset.forName("ASCII").toString());
+ } catch (UnsupportedEncodingException e) {
+ e.printStackTrace();
+ }
Matcher metaMatcher = metaPattern.matcher(str);
String encoding = null;
Thanks,
Ned