Gerard Bouchar created NUTCH-2589:
-------------------------------------

             Summary: HTML redirections are not followed when using parse-tika
                 Key: NUTCH-2589
                 URL: https://issues.apache.org/jira/browse/NUTCH-2589
             Project: Nutch
          Issue Type: Bug
         Environment: nutch-site.xml:
{code:xml}
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
    <property>
        <name>plugin.includes</name>
        <value>protocol-http|parse-tika</value>
    </property>
    <property>
        <name>http.agent.name</name>
        <value>blah</value>
    </property>
</configuration>
{code}

fetched url: https://policies.google.com/technologies/ads
            Reporter: Gerard Bouchar


Html redirections using meta tags are supported in nutch. They work well when 
using parse-html to parse files. However, when using parse-tika, they are not 
detected.

This is because of https://issues.apache.org/jira/browse/TIKA-2652

Tika emits redirection meta tags as :

{code:xml}
<meta name="refresh" content="0; url=http://example.com"/>
{code}

whereas org.apache.nutch.parse.tika.HTMLMetaProcessor expects meta tags having 
the following format :

{code:xml}
<meta http-equiv="refresh" content="0; url=http://example.com";>
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to