Ken Krugler (JIRA) wrote:
> [
> http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12439304
> ]
>
> Ken Krugler commented on NUTCH-353:
> -----------------------------------
>
> +1 that the redirect target is not always the "real" URL that we want to keep.
>
> For example, http://www.ibm.com/developerworks/lotus/downloads/toolkits.html
> => http://www-128.ibm.com/developerworks/lotus/downloads/toolkits.html. This
> holds true for most (all?) developerWorks pages; they redirect to
> www-128.ibm.com/<whatever>, but IBM would love for the URL everybody sees to
> still be www.ibm.com/<whatever>.
>
>
If you check status code of the original URL you get 302 Found. By
definition
10.3.3 302 Found
The requested resource resides temporarily under a different URI. Since
the redirection might be altered on occasion, the client SHOULD continue
to use the Request-URI for future requests. This response is only
cacheable if indicated by a Cache-Control or Expires header field.
In this case there is no need to replace original url with redirected.
I know that a lot of sites use permanent redirects in such cases. But I
don't se any proper solution for both.
regards
Uros
>> pages that serverside forwards will be refetched every time
>> -----------------------------------------------------------
>>
>> Key: NUTCH-353
>> URL: http://issues.apache.org/jira/browse/NUTCH-353
>> Project: Nutch
>> Issue Type: Bug
>> Affects Versions: 0.8.1, 0.9.0
>> Reporter: Stefan Groschupf
>> Assigned To: Andrzej Bialecki
>> Priority: Blocker
>> Fix For: 0.9.0
>>
>> Attachments: doNotRefecthForwarderPagesV1.patch
>>
>>
>> Pages that do a serverside forward are not written with a status change back
>> into the crawlDb. Also the nextFetchTime is not changed.
>> This causes a refetch of the same page again and again. The result is nutch
>> is not polite and refetching the forwarding and target page in each segment
>> iteration. Also it effects the scoring since the forward page contribute
>> it's score to all outlinks.
>>
>
>
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers