[jira] Commented: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

Andrzej Bialecki (JIRA) Thu, 20 Jul 2006 15:11:04 -0700

    [ 
http://issues.apache.org/jira/browse/NUTCH-322?page=comments#action_12422508 ] 
            
Andrzej Bialecki  commented on NUTCH-322:
-----------------------------------------


I hope I don't come across as arguing ... just trying to explain the rationale 
for this. Redirected pages often have content - take a look at e.g. 
http://dmoz.org/Arts (notice missing ending slash). I agree that most of the 
time this content is trivial, but we always read this content anyway. In some 
cases, it's not the content but metadata (HTTP headers) that are important - 
start a protocol analyzer and look what happens when you try to visit 
http://www.svd.se/annonsera (again, no slash at the end) - the second redirect 
will also set a cookie, which may be important for further requests in this 
session. And Nutch stores metadata only when it stores the content ...

There is also a case of content-level redirection, caused by <meta 
http-equiv="refresh" ...>, where you most likely get a full page of content, 
and then after a while you get redirected to another page. This may be 
immediately, but it also may be after 120 seconds - so, the intermediate 
content does matter in this case.

> Fetcher discards ProtocolStatus, doesn't store redirected pages
> ---------------------------------------------------------------
>
>                 Key: NUTCH-322
>                 URL: http://issues.apache.org/jira/browse/NUTCH-322
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>    Affects Versions: 0.8-dev
>            Reporter: Andrzej Bialecki 
>             Fix For: 0.8-dev
>
>
> Fetcher doesn't store ProtocolStatus in output segments. ProtocolStatus 
> contains important information, such as protocol-level response code, 
> lastModified time, and possibly other messages.
> I propose that ProtocolStatus should be stored inside CrawlDatum.metaData, 
> which is then stored into crawl_fetch (in Fetcher.FetcherThread.output()). In 
> addition, if ProtocolStatus contains a valid lastModified time, that 
> CrawlDatum's modified time should also be set to this value.
> Additionally, Fetcher doesn't store redirected pages. Content of such pages 
> is silently discarded. When Fetcher translates from protocol-level status to 
> crawldb-level status it should probably store such pages with the following 
> translation of status codes:
> * ProtocolStatus.TEMP_MOVED -> CrawlDatum.STATUS_DB_RETRY. This code 
> indicates a transient change, so we probably shouldn't mark the initial URL 
> as bad.
> * ProtocolStatus.MOVED -> CrawlDatum.STATUS_DB_GONE. This code indicates a 
> permanent change, so the initial URL is no longer valid, i.e. it will always 
> result in redirects.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (NUTCH-322) Fetcher discards ProtocolStatus, doesn't store redirected pages

Reply via email to