[jira] Commented: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

Andrzej Bialecki (JIRA) Mon, 24 Jul 2006 16:06:33 -0700

    [ 
http://issues.apache.org/jira/browse/NUTCH-322?page=comments#action_12423187 ] 
            
Andrzej Bialecki  commented on NUTCH-322:
-----------------------------------------


Good questions ... ;)

ad 1: Google shows only the final page, and you can access it through both the 
original (starting) url and the final redirected url. You can't view the 
intermediate pages.

To be Google-compatible we should index only the final page, but put it under 
both URLs. This is relatively easy to implement in Fetcher and index-basic, by 
appropriately marking the starting and intermediate pages, skipping any 
non-final pages during indexing, and then adding the original url to the final 
url when indexing the final page.

Also, I think that if redirect refresh time is large (e.g. larger than 20 
seconds) we should consider the pages to be separate, and treat them separately.

ad 2: Google shows only inlinks going to the final url. However, the same 
inlinks can be obtained by using either the starting or the final url. OTOH MSN 
has separate inlinks in each case. I'm not sure yet how we should implement 
this...

> Fetcher discards ProtocolStatus, doesn't store redirected pages
> ---------------------------------------------------------------
>
>                 Key: NUTCH-322
>                 URL: http://issues.apache.org/jira/browse/NUTCH-322
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8-dev
>            Reporter: Andrzej Bialecki 
>             Fix For: 0.8-dev
>
>
> Fetcher doesn't store ProtocolStatus in output segments. ProtocolStatus 
> contains important information, such as protocol-level response code, 
> lastModified time, and possibly other messages.
> I propose that ProtocolStatus should be stored inside CrawlDatum.metaData, 
> which is then stored into crawl_fetch (in Fetcher.FetcherThread.output()). In 
> addition, if ProtocolStatus contains a valid lastModified time, that 
> CrawlDatum's modified time should also be set to this value.
> Additionally, Fetcher doesn't store redirected pages. Content of such pages 
> is silently discarded. When Fetcher translates from protocol-level status to 
> crawldb-level status it should probably store such pages with the following 
> translation of status codes:
> * ProtocolStatus.TEMP_MOVED -> CrawlDatum.STATUS_DB_RETRY. This code 
> indicates a transient change, so we probably shouldn't mark the initial URL 
> as bad.
> * ProtocolStatus.MOVED -> CrawlDatum.STATUS_DB_GONE. This code indicates a 
> permanent change, so the initial URL is no longer valid, i.e. it will always 
> result in redirects.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

Reply via email to