[
https://issues.apache.org/jira/browse/NUTCH-2319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15554993#comment-15554993
]
Sebastian Nagel commented on NUTCH-2319:
----------------------------------------
See the ongoing discussion in user@nutch [Issue Crawling Alternate
URLs|http://www.mail-archive.com/user%40nutch.apache.org/msg14978.html].
Afaics, that's not a problem - the server sends an RSS feed or an HTML page
depending how or from where the HTTP request was sent (e.g. via User-Agent or
the Accept request parameter). When Nutch parses the HTML page saved in the
browser, the link is found as expected.
{noformat}
% bin/nutch plugin protocol-http org.apache.nutch.protocol.http.Http -verbose
http://rssfeeds.azcentral.com/phoenix/asu
robots.txt whitelist not configured.
Status: success(1), lastModified=0
Content Type: application/rss+xml
Content Length: null
Content:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl"
href="http://rssfeeds.azcentral.com/feedblitz_rss.xslt"?><rss
xmlns:content="http://purl.org/rss/1.0/modules/content/" version="2.0"
xmlns:feedburner="http://rssnamespace.org/feedburner/ext/1.0">
<channel>
<title>Phoenix - ASU</title>
<link>http://api-internal.usatoday.com.akadns.net</link>
{noformat}
Save page in local web server as http://localhost/nutch/PhoenixASU.html:
{noformat}
% bin/nutch parsechecker http://localhost/nutch/PhoenixASU.html
...
Status: success(1,0)
Title: Phoenix - ASU
Outlinks: 342
outlink: toUrl: http://localhost/nutch/PhoenixASU_files/fb4styles.css anchor:
outlink: toUrl: http://rssfeeds.azcentral.com/phoenix/asu&x=1 anchor:
outlink: toUrl: http://localhost/nutch/PhoenixASU_files/fb4styles.css anchor:
outlink: toUrl: http://localhost/nutch/PhoenixASU_files/urchin.js anchor:
...
{noformat}
> Link with "rel=alternate" doesn't return in crawl
> --------------------------------------------------
>
> Key: NUTCH-2319
> URL: https://issues.apache.org/jira/browse/NUTCH-2319
> Project: Nutch
> Issue Type: Bug
> Reporter: Zuber
>
> I am using nutch-1.4. I am getting the issue that the nutch doesn't return
> the URLs from the link rel="alternate".
> For example, I am trying to crawl the URL
> http://rssfeeds.azcentral.com/phoenix/asu which contains the below link
> which I am not getting as result.
> <link rel="alternate" type="application/atom+xml"
> href="http://rssfeeds.azcentral.com/phoenix/asu&x=1" title="Phoenix -
> ASU">
> Could you please help
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)