[ https://issues.apache.org/jira/browse/NUTCH-161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sami Siren updated NUTCH-161: ----------------------------- Fix Version/s: 1.0.0 Assignee: Sami Siren Summary: Change Plain text parser to use parser.character.encoding.default property for fall back encoding (was: Plain text parser should use parser.character.encoding.default property for fall back encoding) > Change Plain text parser to use parser.character.encoding.default property > for fall back encoding > ------------------------------------------------------------------------------------------------- > > Key: NUTCH-161 > URL: https://issues.apache.org/jira/browse/NUTCH-161 > Project: Nutch > Issue Type: Bug > Components: indexer > Environment: any > Reporter: KuroSaka TeruHiko > Assigned To: Sami Siren > Priority: Minor > Fix For: 1.0.0 > > > The value of the property parser.character.encoding.default is used as a > fallback character encoding (charset) when HTML parser cannot find the > charset information in HTTP Content-Type header or in META HTTP-EQUIV tag. > But the plain text parser behaves differently. It just uses the system > encoding (Java VM file.encodings, which in turn derives from the OS and the > locale of the environment from which the JVM was spawned). This is not > pretty. To gurantee a consistent behavior, plain text parser should use the > value of the same property. > Though not tested, these changes in > ./src/plugin/parse-text/src/java/org/apache/nutch/parse/text/TextParser.java > should do it: > Insert this statement in the class definition: > private static String defaultCharEncoding = > NutchConf.get().get("parser.character.encoding.default", "windows-1252"); > Replace this: > text = new String(content.getContent()); // use default encoding > with this: > text = new String(content.getContent(), defaultCharEncoding ); // > use default encoding -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers