[
https://issues.apache.org/jira/browse/NUTCH-2589?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gerard Bouchar updated NUTCH-2589:
----------------------------------
Description:
Html redirections using meta tags are supported in nutch. They work well when
using parse-html to parse files. However, when using parse-tika, they are not
detected.
This is because of https://issues.apache.org/jira/browse/TIKA-2652
Tika emits redirection meta tags as :
{code:xml}
<meta name="refresh" content="0; url=http://example.com"/>
{code}
whereas org.apache.nutch.parse.tika.HTMLMetaProcessor expects meta tags having
the following format :
{code:xml}
<meta http-equiv="refresh" content="0; url=http://example.com">
{code}
The bug can be reproduced with the following nutch-site.xml:
{code:xml}
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>plugin.includes</name>
<value>protocol-http|parse-tika</value>
</property>
<property>
<name>http.agent.name</name>
<value>blah</value>
</property>
</configuration>
{code}
fetching this url: http://www.google.com/policies/technologies/ads/
The resulting status is {code}success(1,0){code} whereas using parse-html, the
resulting status is {code:html}success(1,100),
args[0]=https://policies.google.com/technologies/ads, args[1]=0{code}
was:
Html redirections using meta tags are supported in nutch. They work well when
using parse-html to parse files. However, when using parse-tika, they are not
detected.
This is because of https://issues.apache.org/jira/browse/TIKA-2652
Tika emits redirection meta tags as :
{code:xml}
<meta name="refresh" content="0; url=http://example.com"/>
{code}
whereas org.apache.nutch.parse.tika.HTMLMetaProcessor expects meta tags having
the following format :
{code:xml}
<meta http-equiv="refresh" content="0; url=http://example.com">
{code}
> HTML redirections are not followed when using parse-tika
> --------------------------------------------------------
>
> Key: NUTCH-2589
> URL: https://issues.apache.org/jira/browse/NUTCH-2589
> Project: Nutch
> Issue Type: Bug
> Reporter: Gerard Bouchar
> Priority: Major
>
> Html redirections using meta tags are supported in nutch. They work well when
> using parse-html to parse files. However, when using parse-tika, they are not
> detected.
> This is because of https://issues.apache.org/jira/browse/TIKA-2652
> Tika emits redirection meta tags as :
> {code:xml}
> <meta name="refresh" content="0; url=http://example.com"/>
> {code}
> whereas org.apache.nutch.parse.tika.HTMLMetaProcessor expects meta tags
> having the following format :
> {code:xml}
> <meta http-equiv="refresh" content="0; url=http://example.com">
> {code}
> The bug can be reproduced with the following nutch-site.xml:
> {code:xml}
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
> <!-- Put site-specific property overrides in this file. -->
> <configuration>
> <property>
> <name>plugin.includes</name>
> <value>protocol-http|parse-tika</value>
> </property>
> <property>
> <name>http.agent.name</name>
> <value>blah</value>
> </property>
> </configuration>
> {code}
> fetching this url: http://www.google.com/policies/technologies/ads/
> The resulting status is {code}success(1,0){code} whereas using parse-html,
> the resulting status is {code:html}success(1,100),
> args[0]=https://policies.google.com/technologies/ads, args[1]=0{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)