Plain text parser should use parser.character.encoding.default property for 
fall back encoding
----------------------------------------------------------------------------------------------

         Key: NUTCH-161
         URL: http://issues.apache.org/jira/browse/NUTCH-161
     Project: Nutch
        Type: Bug
  Components: indexer  
 Environment: any
    Reporter: KuroSaka TeruHiko
    Priority: Minor


The value of the property parser.character.encoding.default is used as a 
fallback character encoding (charset) when HTML parser cannot find the charset 
information in HTTP Content-Type header or in META HTTP-EQUIV tag.  But the 
plain text parser behaves differently.  It just uses the system encoding (Java 
VM file.encodings, which in turn derives from the OS and the locale of the 
environment from which the JVM was spawned).  This is not pretty.  To gurantee 
a consistent behavior, plain text parser should use the value of the same 
property.

Though not tested, these changes in 
./src/plugin/parse-text/src/java/org/apache/nutch/parse/text/TextParser.java 
should do it:
Insert this statement in the class definition:
  private static String defaultCharEncoding =
    NutchConf.get().get("parser.character.encoding.default", "windows-1252");

Replace this:
      text = new String(content.getContent());    // use default encoding
with this:
      text = new String(content.getContent(), defaultCharEncoding );    // use 
default encoding


-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira

Reply via email to