Thanks Eelco.
I applied the patch that you suggested. It works perfect when the
redirect does return a new URL.
However, when the redirect does not return a new URL (no URL or the same
URL), the status is not updated. Wouldn't it be better to set the status
to STATUS_FETCH_GONE in that case? Something like below.
Fetcher.java:
...
if (newUrl != null && !newUrl.equals(url.toString())) {
output(url, datum, null, CrawlDatum.STATUS_FETCH_SUCCESS);
url = new UTF8(newUrl);
redirecting = true;
redirectCount++;
if (LOG.isDebugEnabled()) {
LOG.debug(" - protocol redirect to " + url);
}
} else if (LOG.isDebugEnabled()) {
output(url, datum, null, CrawlDatum.STATUS_FETCH_GONE);
LOG.debug(" - protocol redirect skipped: " +
(newUrl != null ? "to same url" : "filtered"));
}
...
Mathijs
Eelco Lempsink wrote:
> On 13-jan-2007, at 14:34, Mathijs Homminga wrote:
>> I'm using nutch 0.8.1 and I noticed the following.
>> When pageA redirects to pageB (HTTP 3xx), pageA remains unfetched in
>> the crawlDB (pageB is fetched).
>>
>> Hence, pageA shows up in each generate/fetch/updatedb iteration.
>>
>> Is this a bug? I found a previous thread on this list which describes
>> this issue too:
>> http://www.mail-archive.com/[email protected]/msg04599.html
>
> Yes. See http://issues.apache.org/jira/browse/NUTCH-273
>
> --Regards,
>
> Eelco Lempsink
>
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general