Re: [jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

Uroš Gruber Mon, 02 Oct 2006 23:29:57 -0700

Ken Krugler (JIRA) wrote:

[ http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12439304 ]Ken Krugler commented on NUTCH-353:

-----------------------------------


+1 that the redirect target is not always the "real" URL that we want to keep.

For example, http://www.ibm.com/developerworks/lotus/downloads/toolkits.html => 
http://www-128.ibm.com/developerworks/lotus/downloads/toolkits.html. This holds true for most  
(all?) developerWorks pages; they redirect to www-128.ibm.com/<whatever>, but IBM would 
love for the URL everybody sees to still be www.ibm.com/<whatever>.

If you check status code of the original URL you get 302 Found. Bydefinition



     10.3.3 302 Found

The requested resource resides temporarily under a different URI. Sincethe redirection might be altered on occasion, the client SHOULD continueto use the Request-URI for future requests. This response is onlycacheable if indicated by a Cache-Control or Expires header field.


In this case there is no need to replace original url with redirected.

I know that a lot of sites use permanent redirects in such cases. But Idon't se any proper solution for both.



regards

Uros

pages that serverside forwards will be refetched every time
-----------------------------------------------------------

                Key: NUTCH-353
                URL: http://issues.apache.org/jira/browse/NUTCH-353
            Project: Nutch
         Issue Type: Bug
   Affects Versions: 0.8.1, 0.9.0
           Reporter: Stefan Groschupf
Assigned To: Andrzej BialeckiPriority: Blocker
            Fix For: 0.9.0

        Attachments: doNotRefecthForwarderPagesV1.patch
Pages that do a serverside forward are not written with a status change back into the crawlDb. Also the nextFetchTime is not changed.This causes a refetch of the same page again and again. The result is nutch is not polite and refetching the forwarding and target page in each segment iteration. Also it effects the scoring since the forward page contribute it's score to all outlinks.

Re: [jira] Commented: (NUTCH-353) pages that serverside forwards will be refetched every time

Reply via email to