Plain text parser should use parser.character.encoding.default property for
fall back encoding
----------------------------------------------------------------------------------------------
Key: NUTCH-161
URL: http://issues.apache.org/jira/browse/NUTCH-161
Project: Nutch
Type: Bug
Components: indexer
Environment: any
Reporter: KuroSaka TeruHiko
Priority: Minor
The value of the property parser.character.encoding.default is used as a
fallback character encoding (charset) when HTML parser cannot find the charset
information in HTTP Content-Type header or in META HTTP-EQUIV tag. But the
plain text parser behaves differently. It just uses the system encoding (Java
VM file.encodings, which in turn derives from the OS and the locale of the
environment from which the JVM was spawned). This is not pretty. To gurantee
a consistent behavior, plain text parser should use the value of the same
property.
Though not tested, these changes in
./src/plugin/parse-text/src/java/org/apache/nutch/parse/text/TextParser.java
should do it:
Insert this statement in the class definition:
private static String defaultCharEncoding =
NutchConf.get().get("parser.character.encoding.default", "windows-1252");
Replace this:
text = new String(content.getContent()); // use default encoding
with this:
text = new String(content.getContent(), defaultCharEncoding ); // use
default encoding
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
http://www.atlassian.com/software/jira