[
https://issues.apache.org/jira/browse/NUTCH-572?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12540857
]
Dennis Kubes commented on NUTCH-572:
------------------------------------
The most recent patch for NUTCH-547 handles the most common version of this
error where temp redirected urls are not stored or indexed. We may still want
to have a discussion about how to handle scoring issues for permanent redirects.
> Scoring and redirected Urls
> ---------------------------
>
> Key: NUTCH-572
> URL: https://issues.apache.org/jira/browse/NUTCH-572
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Affects Versions: 0.8, 0.8.1, 0.9.0
> Environment: All
> Reporter: Dennis Kubes
> Assignee: Dennis Kubes
> Fix For: 1.0.0
>
>
> When a redirect is found for a given url, the new or end url is stored as the
> content page and the old CrawlDatum get one of a few redirect codes. The
> page that gets indexed in Nutch is the end page and it gets indexed under the
> end url. Many times a site will have a significant number of links pointing
> to start page and very few pointing to the redirected end page. This is
> especially true for external links. Opic scores do not get transfered to the
> end page but stay with the start page (the one doing the redirecting). But
> the start page doesn't get indexed. Hence the end page will show up in the
> index but under a usually much reduced score. A good example of this is
> cnn.com:
> URL: http://www.cnn.com/
> Version: 6
> Status: 5 (db_redir_perm)
> Fetch time: Tue Dec 04 11:02:09 CST 2007
> Modified time: Wed Dec 31 18:00:00 CST 1969
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 51.19438
> Signature: b5baaf80e9e10aa6205fc39051c362ff
> Metadata: _pst_:success(1), lastModified=0
> which redirects to http://www.cnn.com/?refresh=1
> URL: http://www.cnn.com/?refresh=1
> Version: 6
> Status: 2 (db_fetched)
> Fetch time: Tue Dec 04 11:02:11 CST 2007
> Modified time: Wed Dec 31 18:00:00 CST 1969
> Retries since fetch: 0
> Retry interval: 2592000 seconds (30 days)
> Score: 1.0
> Signature: b5baaf80e9e10aa6205fc39051c362ff
> Metadata: _pst_:success(1), lastModified=0
> Now, cnn which should be one of the highest, if not the highest ranking site
> in the index for keywords such as news in fact doesn't show up in the index
> and it's redirected end page appears much farther down in search results. My
> proposal is we somehow make OPIC scores follow redirects. To do this we
> would most likely need to store a start and end url for redirected urls.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.