Sebastian Nagel created NUTCH-2749:
--------------------------------------
Summary: Fetcher and scoring-opic: transfer score to redirects
Key: NUTCH-2749
URL: https://issues.apache.org/jira/browse/NUTCH-2749
Project: Nutch
Issue Type: Improvement
Components: fetcher, plugin, scoring
Affects Versions: 1.16
Reporter: Sebastian Nagel
Fix For: 1.17
See the discussion "[Score value lost after two successive
redirects|https://lists.apache.org/thread.html/dbf7737fb8e6566d252e76290db806fac19dc56b854749c78a995bb8@1385999850@%3Cuser.nutch.apache.org%3E]"
dating back to 2012.
Redirects should be enabled to pass scores to the targets. This is mandatory
for reliable scoring, otherwise scores often get lost when a link target is
redirected. Eg. when the target site has moved from http:// to [https://],
incoming links to http:// pages are usually redirected to https:// (on the
target site), and the incoming score is lost. If the migration to https://
happened recently the scores for this site might just become zero.
I aggree with [~markus17]'s comment in the mentioned discussion @user that "it
cannot be a good idea to just copy over the score". Instead redirects should
have the same effect as a page containing a single href link.
This would require the following change(s):
1. in Fetcher (class FetcherThread): the score should be passed forward to the
redirect target
* because the method {{distributeScoreToOutlinks(...)}} cannot be called for
redirects (no content is parsed) we would need a dedicated hook
distributeScoreToRedirect(Text fromUrl, Text toUrl, CrawlDatum source,
CrawlDatum target)
* to be called both for "recorded" and followed redirects (depending on
http.max.redirect)
* scoring strategies can be implemented there, eg. apply
"db.score.link.\{internal,external}"
* to be implemented as [default
method|https://docs.oracle.com/javase/tutorial/java/IandI/defaultmethods.html]
which avoids that existing scoring filter plugins are broken
2. during CrawlDb update (class CrawlDbReducer), there are different cases to
consider:
a. URL not yet in CrawlDb: nothing to do if the score has been already passed
forward (step 1)
b. URL already in CrawlDb, redirects not followed in fetcher (htt.redirect.max
== 0): the redirect target has been stored as db_outlink, so it will be used in
the scoring method updateDbScore(...) -> nothing to do
c. URL already in CrawlDb, fetcher follows redirects: to get the same behavior
as for incoming links we would need to mark fetches stemming from a followed
redirect and use them in a modified updateDbScore(...)
Being pragmatic I would address in this issue only point 1 and (implicitely 2a
and 2b). Point 2c would require significant changes and isn't easy to control
in the worst case, if there are multiple redirects followed all ending in the
same target
--
This message was sent by Atlassian Jira
(v8.3.4#803005)