Redirected URLs and possibly all of their outlinked URLs have invalid scores.
-----------------------------------------------------------------------------

                 Key: NUTCH-1044
                 URL: https://issues.apache.org/jira/browse/NUTCH-1044
             Project: Nutch
          Issue Type: Bug
          Components: fetcher, parser
    Affects Versions: 1.3
            Reporter: Nutch User - 1


1.: 
http://lucene.472066.n3.nabble.com/URL-redirection-and-zero-scores-td3085311.html
2.: 
http://lucene.472066.n3.nabble.com/A-possible-solution-to-my-URL-redirection-and-zero-scores-problem-td3162164.html

Please note that also URLs redirected by meta refresh redirection do have 
invalid scores. For such URLs a CrawlDatum is created on the lines 157-177 of 
ParseOutputFormat.java 
(http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/parse/ParseOutputFormat.java?view=markup).
 The new CrawlDatum's score isn't set anywhere after the creation so it's 1.0f 
as can be seen on the line 122 of CrawlDatum.java 
(http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup).

It's another question whether the redirected URL's score should be just passed 
to the new URL or should the redirection be considered as a link in which case 
the new URL's score would be 'originalScore' / ('numberOfOutlinks' + 1).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to