[ https://issues.apache.org/jira/browse/NUTCH-369?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Renaud Richardet updated NUTCH-369: ----------------------------------- Attachment: remover.diff just FYI, you can further filter which element neko should keep and remove. see the patch for an example and http://people.apache.org/~andyc/neko/doc/html/settings.html > StringUtil.resolveEncodingAlias is unuseful. > --------------------------------------------- > > Key: NUTCH-369 > URL: https://issues.apache.org/jira/browse/NUTCH-369 > Project: Nutch > Issue Type: Bug > Components: fetcher > Affects Versions: 0.9.0 > Environment: all > Reporter: King Kong > Priority: Minor > Attachments: patch.diff, remover.diff > > > After we defined encoding alias map in StringUtil , but parse html use > orginal encoding also. > I found it is reading charset from meta in nekohtml which HtmlParser used . > we can set it's feature > "http://cyberneko.org/html/features/scanner/ignore-specified-charset" to true > that nekohtml will use encoding we set; > concretely, > private DocumentFragment parseNeko(InputSource input) throws Exception { > DOMFragmentParser parser = new DOMFragmentParser(); > // some plugins, e.g., creativecommons, need to examine html comments > try { > + > parser.setFeature("http://cyberneko.org/html/features/scanner/ignore-specified-charset",true); > parser.setFeature("http://apache.org/xml/features/include-comments", > true); > .... > BTW, It must be add on front of try block,because the following sentence > (parser.setFeature("http://apache.org/xml/features/include-comments", > true);) will throw exception. > -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.