[jira] Commented: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

Andrzej Bialecki (JIRA) Thu, 20 Jul 2006 03:08:31 -0700

    [ 
http://issues.apache.org/jira/browse/NUTCH-322?page=comments#action_12422383 ] 
            
Andrzej Bialecki  commented on NUTCH-322:
-----------------------------------------


It's true that redirected pages are fetched, but it's also true that the 
intermediate pages (the ones that we were redirected from) are discarded. 
Please see the logic in Fetcher.FetcherThread.run() - there is no call to 
output() in such case, we just proceed to fetch the page we were redirected to.

Re: ProtocolStatus: if we decide to store the intermediate redirected pages, 
then ProtocolStatus will be stored under each intermediate URL, so there is no 
need to add it explicitly to ProtocolStatus. Also, in case of redirects, the 
URL we were redirected to is already stored in ProtocolStatus (or ParseStatus).

> Fetcher discards ProtocolStatus, doesn't store redirected pages
> ---------------------------------------------------------------
>
>                 Key: NUTCH-322
>                 URL: http://issues.apache.org/jira/browse/NUTCH-322
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8-dev
>            Reporter: Andrzej Bialecki 
>             Fix For: 0.8-dev
>
>
> Fetcher doesn't store ProtocolStatus in output segments. ProtocolStatus 
> contains important information, such as protocol-level response code, 
> lastModified time, and possibly other messages.
> I propose that ProtocolStatus should be stored inside CrawlDatum.metaData, 
> which is then stored into crawl_fetch (in Fetcher.FetcherThread.output()). In 
> addition, if ProtocolStatus contains a valid lastModified time, that 
> CrawlDatum's modified time should also be set to this value.
> Additionally, Fetcher doesn't store redirected pages. Content of such pages 
> is silently discarded. When Fetcher translates from protocol-level status to 
> crawldb-level status it should probably store such pages with the following 
> translation of status codes:
> * ProtocolStatus.TEMP_MOVED -> CrawlDatum.STATUS_DB_RETRY. This code 
> indicates a transient change, so we probably shouldn't mark the initial URL 
> as bad.
> * ProtocolStatus.MOVED -> CrawlDatum.STATUS_DB_GONE. This code indicates a 
> permanent change, so the initial URL is no longer valid, i.e. it will always 
> result in redirects.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

Reply via email to