Doug Cook wrote:
In this case, the site uses the "right" kind of redirect. Unfortunately, as
you point out, it's not at all clear that we can rely on sites correctly
choosing the type of redirect (I tried a few sites and most were 302s, even
in cases where the redirect was to the permanent, canonical version of the
page). And then there's the problem of what to do with meta refresh tags,
which don't have a "permanent" vs. "temporary" indication.

An alternative is to use the link structure - the page with the most
external links is likely the canonical version of the page. (Although with
permanent redirects, there is a time lag as sites linking to the page stop
using the old name and start using the new name). This won't work well in
small crawls, though, given the relative paucity of links.

This could be something, because others most certainly don't link redirects. But as you point out problem with permanent links, we have just the same stuff in our portal. We have new structure and some links have changed because of that we add permanent redirects from old to new ones. In this case the only solution is to replace url with permanent.

In any case, if we have an inexpensive way of aliasing the two to be the
same, we won't lose any anchor text, and we're effectively not "throwing
out" either URL, so it matters less which one we choose.
Do you have any example what would this aliasdb look like.

regards
      -Doug


Uro? Gruber-2 wrote:
Ken Krugler (JIRA) wrote:
    [
http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12439304
] Ken Krugler commented on NUTCH-353:
-----------------------------------

+1 that the redirect target is not always the "real" URL that we want to
keep.

For example,
http://www.ibm.com/developerworks/lotus/downloads/toolkits.html =>
http://www-128.ibm.com/developerworks/lotus/downloads/toolkits.html. This
holds true for most  (all?) developerWorks pages; they redirect to
www-128.ibm.com/<whatever>, but IBM would love for the URL everybody sees
to still be www.ibm.com/<whatever>.

If you check status code of the original URL you get 302 Found. By definition


      10.3.3 302 Found

The requested resource resides temporarily under a different URI. Since the redirection might be altered on occasion, the client SHOULD continue to use the Request-URI for future requests. This response is only cacheable if indicated by a Cache-Control or Expires header field.

In this case there is no need to replace original url with redirected.

I know that a lot of sites use permanent redirects in such cases. But I don't se any proper solution for both.


regards

Uros
pages that serverside forwards will be refetched every time
-----------------------------------------------------------

                Key: NUTCH-353
                URL: http://issues.apache.org/jira/browse/NUTCH-353
            Project: Nutch
         Issue Type: Bug
   Affects Versions: 0.8.1, 0.9.0
           Reporter: Stefan Groschupf
Assigned To: Andrzej Bialecki Priority: Blocker
            Fix For: 0.9.0

        Attachments: doNotRefecthForwarderPagesV1.patch


Pages that do a serverside forward are not written with a status change
back into the crawlDb. Also the nextFetchTime is not changed. This causes a refetch of the same page again and again. The result is
nutch is not polite and refetching the forwarding and target page in
each segment iteration. Also it effects the scoring since the forward
page contribute it's score to all outlinks.



Reply via email to