In this case, the site uses the "right" kind of redirect. Unfortunately, as
you point out, it's not at all clear that we can rely on sites correctly
choosing the type of redirect (I tried a few sites and most were 302s, even
in cases where the redirect was to the permanent, canonical version of the
page). And then there's the problem of what to do with meta refresh tags,
which don't have a "permanent" vs. "temporary" indication.
An alternative is to use the link structure - the page with the most
external links is likely the canonical version of the page. (Although with
permanent redirects, there is a time lag as sites linking to the page stop
using the old name and start using the new name). This won't work well in
small crawls, though, given the relative paucity of links.
In any case, if we have an inexpensive way of aliasing the two to be the
same, we won't lose any anchor text, and we're effectively not "throwing
out" either URL, so it matters less which one we choose.
-Doug
Uro? Gruber-2 wrote:
>
> Ken Krugler (JIRA) wrote:
>> [
>> http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12439304
>> ]
>>
>> Ken Krugler commented on NUTCH-353:
>> -----------------------------------
>>
>> +1 that the redirect target is not always the "real" URL that we want to
>> keep.
>>
>> For example,
>> http://www.ibm.com/developerworks/lotus/downloads/toolkits.html =>
>> http://www-128.ibm.com/developerworks/lotus/downloads/toolkits.html. This
>> holds true for most (all?) developerWorks pages; they redirect to
>> www-128.ibm.com/<whatever>, but IBM would love for the URL everybody sees
>> to still be www.ibm.com/<whatever>.
>>
>>
> If you check status code of the original URL you get 302 Found. By
> definition
>
>
> 10.3.3 302 Found
>
> The requested resource resides temporarily under a different URI. Since
> the redirection might be altered on occasion, the client SHOULD continue
> to use the Request-URI for future requests. This response is only
> cacheable if indicated by a Cache-Control or Expires header field.
>
> In this case there is no need to replace original url with redirected.
>
> I know that a lot of sites use permanent redirects in such cases. But I
> don't se any proper solution for both.
>
>
> regards
>
> Uros
>>> pages that serverside forwards will be refetched every time
>>> -----------------------------------------------------------
>>>
>>> Key: NUTCH-353
>>> URL: http://issues.apache.org/jira/browse/NUTCH-353
>>> Project: Nutch
>>> Issue Type: Bug
>>> Affects Versions: 0.8.1, 0.9.0
>>> Reporter: Stefan Groschupf
>>> Assigned To: Andrzej Bialecki
>>> Priority: Blocker
>>> Fix For: 0.9.0
>>>
>>> Attachments: doNotRefecthForwarderPagesV1.patch
>>>
>>>
>>> Pages that do a serverside forward are not written with a status change
>>> back into the crawlDb. Also the nextFetchTime is not changed.
>>> This causes a refetch of the same page again and again. The result is
>>> nutch is not polite and refetching the forwarding and target page in
>>> each segment iteration. Also it effects the scoring since the forward
>>> page contribute it's score to all outlinks.
>>>
>>
>>
>
>
>
--
View this message in context:
http://www.nabble.com/-jira--Created%3A-%28NUTCH-353%29-pages-that-serverside-forwards-will-be-refetched-every-time-tf2125422.html#a6622168
Sent from the Nutch - Dev mailing list archive at Nabble.com.
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers