Ken Krugler (JIRA) wrote:
>     [ 
> http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12439304 
> ] 
>             
> Ken Krugler commented on NUTCH-353:
> -----------------------------------
>
> +1 that the redirect target is not always the "real" URL that we want to keep.
>
> For example, http://www.ibm.com/developerworks/lotus/downloads/toolkits.html 
> => http://www-128.ibm.com/developerworks/lotus/downloads/toolkits.html. This 
> holds true for most  (all?) developerWorks pages; they redirect to 
> www-128.ibm.com/<whatever>, but IBM would love for the URL everybody sees to 
> still be www.ibm.com/<whatever>.
>
>   
If you check status code of the original URL you get 302 Found. By 
definition


      10.3.3 302 Found

The requested resource resides temporarily under a different URI. Since 
the redirection might be altered on occasion, the client SHOULD continue 
to use the Request-URI for future requests. This response is only 
cacheable if indicated by a Cache-Control or Expires header field.

In this case there is no need to replace original url with redirected.

I know that a lot of sites use permanent redirects in such cases. But I 
don't se any proper solution for both.


regards

Uros
>> pages that serverside forwards will be refetched every time
>> -----------------------------------------------------------
>>
>>                 Key: NUTCH-353
>>                 URL: http://issues.apache.org/jira/browse/NUTCH-353
>>             Project: Nutch
>>          Issue Type: Bug
>>    Affects Versions: 0.8.1, 0.9.0
>>            Reporter: Stefan Groschupf
>>         Assigned To: Andrzej Bialecki 
>>            Priority: Blocker
>>             Fix For: 0.9.0
>>
>>         Attachments: doNotRefecthForwarderPagesV1.patch
>>
>>
>> Pages that do a serverside forward are not written with a status change back 
>> into the crawlDb. Also the nextFetchTime is not changed. 
>> This causes a refetch of the same page again and again. The result is nutch 
>> is not polite and refetching the forwarding and target page in each segment 
>> iteration. Also it effects the scoring since the forward page contribute 
>> it's score to all outlinks.
>>     
>
>   


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to