[jira] [Commented] (NUTCH-2319) Link with "rel=alternate" doesn't return in crawl

Sebastian Nagel (JIRA) Fri, 07 Oct 2016 05:44:54 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15554993#comment-15554993
 ]


Sebastian Nagel commented on NUTCH-2319:
----------------------------------------

See the ongoing discussion in user@nutch [Issue Crawling Alternate 
URLs|http://www.mail-archive.com/user%40nutch.apache.org/msg14978.html]. 
Afaics, that's not a problem - the server sends an RSS feed or an HTML page 
depending how or from where the HTTP request was sent (e.g. via User-Agent or 
the Accept request parameter). When Nutch parses the HTML page saved in the 
browser, the link is found as expected.

{noformat}
% bin/nutch plugin protocol-http org.apache.nutch.protocol.http.Http -verbose 
http://rssfeeds.azcentral.com/phoenix/asu
robots.txt whitelist not configured.
Status: success(1), lastModified=0
Content Type: application/rss+xml
Content Length: null
Content:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" 
href="http://rssfeeds.azcentral.com/feedblitz_rss.xslt";?><rss 
xmlns:content="http://purl.org/rss/1.0/modules/content/";  version="2.0" 
xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0";>
  <channel>
    <title>Phoenix - ASU</title>
    <link>http://api-internal.usatoday.com.akadns.net</link>
{noformat}

Save page in local web server as http://localhost/nutch/PhoenixASU.html:
{noformat}
% bin/nutch parsechecker http://localhost/nutch/PhoenixASU.html
...
Status: success(1,0)
Title: Phoenix - ASU
Outlinks: 342
  outlink: toUrl: http://localhost/nutch/PhoenixASU_files/fb4styles.css anchor: 
  outlink: toUrl: http://rssfeeds.azcentral.com/phoenix/asu&x=1 anchor: 
  outlink: toUrl: http://localhost/nutch/PhoenixASU_files/fb4styles.css anchor: 
  outlink: toUrl: http://localhost/nutch/PhoenixASU_files/urchin.js anchor: 
...
{noformat}

> Link with "rel=alternate" doesn't return in crawl 
> --------------------------------------------------
>
>                 Key: NUTCH-2319
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2319
>             Project: Nutch
>          Issue Type: Bug
>            Reporter: Zuber
>
> I am using nutch-1.4. I am getting the issue that the nutch doesn't return 
> the URLs from the link rel="alternate".
>  For example, I am trying to crawl the URL  
> http://rssfeeds.azcentral.com/phoenix/asu which contains the  below link 
> which I am not getting as result.
> <link rel="alternate" type="application/atom+xml" 
> href="http://rssfeeds.azcentral.com/phoenix/asu&amp;x=1"; title="Phoenix - 
> ASU">
> Could you please help



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (NUTCH-2319) Link with "rel=alternate" doesn't return in crawl

Reply via email to