[jira] [Created] (NUTCH-2749) Fetcher and scoring-opic: transfer score to redirects

Sebastian Nagel (Jira) Fri, 18 Oct 2019 09:21:33 -0700

Sebastian Nagel created NUTCH-2749:
--------------------------------------

             Summary: Fetcher and scoring-opic: transfer score to redirects
                 Key: NUTCH-2749
                 URL: https://issues.apache.org/jira/browse/NUTCH-2749
             Project: Nutch
          Issue Type: Improvement
          Components: fetcher, plugin, scoring
    Affects Versions: 1.16
            Reporter: Sebastian Nagel
             Fix For: 1.17



See the discussion "[Score value lost after two successive 
redirects|https://lists.apache.org/thread.html/dbf7737fb8e6566d252e76290db806fac19dc56b854749c78a995bb8@1385999850@%3Cuser.nutch.apache.org%3E]";
 dating back to 2012.

Redirects should be enabled to pass scores to the targets. This is mandatory 
for reliable scoring, otherwise scores often get lost when a link target is 
redirected. Eg. when the target site has moved from http:// to [https://], 
incoming links to http:// pages are usually redirected to https:// (on the 
target site), and the incoming score is lost. If the migration to https:// 
happened recently the scores for this site might just become zero.

I aggree with [~markus17]'s comment in the mentioned discussion @user that "it 
cannot be a good idea to just copy over the score". Instead redirects should 
have the same effect as a page containing a single href link.

This would require the following change(s):

1. in Fetcher (class FetcherThread): the score should be passed forward to the 
redirect target
 * because the method {{distributeScoreToOutlinks(...)}} cannot be called for 
redirects (no content is parsed) we would need a dedicated hook
 distributeScoreToRedirect(Text fromUrl, Text toUrl, CrawlDatum source, 
CrawlDatum target)
 * to be called both for "recorded" and followed redirects (depending on 
http.max.redirect)
 * scoring strategies can be implemented there, eg. apply 
"db.score.link.\{internal,external}"
 * to be implemented as [default 
method|https://docs.oracle.com/javase/tutorial/java/IandI/defaultmethods.html] 
which avoids that existing scoring filter plugins are broken

2. during CrawlDb update (class CrawlDbReducer), there are different cases to 
consider:

a. URL not yet in CrawlDb: nothing to do if the score has been already passed 
forward (step 1)

b. URL already in CrawlDb, redirects not followed in fetcher (htt.redirect.max 
== 0): the redirect target has been stored as db_outlink, so it will be used in 
the scoring method updateDbScore(...) -> nothing to do

c. URL already in CrawlDb, fetcher follows redirects: to get the same behavior 
as for incoming links we would need to mark fetches stemming from a followed 
redirect and use them in a modified updateDbScore(...)

Being pragmatic I would address in this issue only point 1 and (implicitely 2a 
and 2b). Point 2c would require significant changes and isn't easy to control 
in the worst case, if there are multiple redirects followed all ending in the 
same target



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Created] (NUTCH-2749) Fetcher and scoring-opic: transfer score to redirects

Reply via email to