[ https://issues.apache.org/jira/browse/NUTCH-2589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sebastian Nagel updated NUTCH-2589: ----------------------------------- Fix Version/s: 1.15 > HTML redirections are not followed when using parse-tika > -------------------------------------------------------- > > Key: NUTCH-2589 > URL: https://issues.apache.org/jira/browse/NUTCH-2589 > Project: Nutch > Issue Type: Bug > Affects Versions: 1.14 > Reporter: Gerard Bouchar > Priority: Major > Fix For: 1.15 > > > Html redirections using meta tags are supported in nutch. They work well when > using parse-html to parse files. However, when using parse-tika, they are not > detected. > This is because of https://issues.apache.org/jira/browse/TIKA-2652 > Tika emits redirection meta tags as : > {code:xml} > <meta name="refresh" content="0; url=http://example.com"/> > {code} > whereas org.apache.nutch.parse.tika.HTMLMetaProcessor expects meta tags > having the following format : > {code:xml} > <meta http-equiv="refresh" content="0; url=http://example.com"> > {code} > The bug can be reproduced with the following nutch-site.xml: > {code:xml} > <?xml version="1.0"?> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> > <!-- Put site-specific property overrides in this file. --> > <configuration> > <property> > <name>plugin.includes</name> > <value>protocol-http|parse-tika</value> > </property> > <property> > <name>http.agent.name</name> > <value>blah</value> > </property> > </configuration> > {code} > fetching this url: http://www.google.com/policies/technologies/ads/ > The resulting status is {code}success(1,0){code} whereas using parse-html, > the resulting status is {code:html}success(1,100), > args[0]=https://policies.google.com/technologies/ads, args[1]=0{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)