[ https://issues.apache.org/jira/browse/NUTCH-2589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Gerard Bouchar updated NUTCH-2589: ---------------------------------- Description: Html redirections using meta tags are supported in nutch. They work well when using parse-html to parse files. However, when using parse-tika, they are not detected. This is because of https://issues.apache.org/jira/browse/TIKA-2652 Tika emits redirection meta tags as : {code:xml} <meta name="refresh" content="0; url=http://example.com"/> {code} whereas org.apache.nutch.parse.tika.HTMLMetaProcessor expects meta tags having the following format : {code:xml} <meta http-equiv="refresh" content="0; url=http://example.com"> {code} The bug can be reproduced with the following nutch-site.xml: {code:xml} <?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>plugin.includes</name> <value>protocol-http|parse-tika</value> </property> <property> <name>http.agent.name</name> <value>blah</value> </property> </configuration> {code} fetching this url: http://www.google.com/policies/technologies/ads/ The resulting status is {code}success(1,0){code} whereas using parse-html, the resulting status is {code:html}success(1,100), args[0]=https://policies.google.com/technologies/ads, args[1]=0{code} was: Html redirections using meta tags are supported in nutch. They work well when using parse-html to parse files. However, when using parse-tika, they are not detected. This is because of https://issues.apache.org/jira/browse/TIKA-2652 Tika emits redirection meta tags as : {code:xml} <meta name="refresh" content="0; url=http://example.com"/> {code} whereas org.apache.nutch.parse.tika.HTMLMetaProcessor expects meta tags having the following format : {code:xml} <meta http-equiv="refresh" content="0; url=http://example.com"> {code} > HTML redirections are not followed when using parse-tika > -------------------------------------------------------- > > Key: NUTCH-2589 > URL: https://issues.apache.org/jira/browse/NUTCH-2589 > Project: Nutch > Issue Type: Bug > Reporter: Gerard Bouchar > Priority: Major > > Html redirections using meta tags are supported in nutch. They work well when > using parse-html to parse files. However, when using parse-tika, they are not > detected. > This is because of https://issues.apache.org/jira/browse/TIKA-2652 > Tika emits redirection meta tags as : > {code:xml} > <meta name="refresh" content="0; url=http://example.com"/> > {code} > whereas org.apache.nutch.parse.tika.HTMLMetaProcessor expects meta tags > having the following format : > {code:xml} > <meta http-equiv="refresh" content="0; url=http://example.com"> > {code} > The bug can be reproduced with the following nutch-site.xml: > {code:xml} > <?xml version="1.0"?> > <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> > <!-- Put site-specific property overrides in this file. --> > <configuration> > <property> > <name>plugin.includes</name> > <value>protocol-http|parse-tika</value> > </property> > <property> > <name>http.agent.name</name> > <value>blah</value> > </property> > </configuration> > {code} > fetching this url: http://www.google.com/policies/technologies/ads/ > The resulting status is {code}success(1,0){code} whereas using parse-html, > the resulting status is {code:html}success(1,100), > args[0]=https://policies.google.com/technologies/ads, args[1]=0{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)