[Nutch Wiki] Trivial Update of "NutchScoring" by LewisJohnMcgibbney

Apache Wiki Sun, 21 Sep 2014 11:01:04 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "NutchScoring" page has been changed by LewisJohnMcgibbney:
https://wiki.apache.org/nutch/NutchScoring?action=diff&rev1=8&rev2=9

Scoring occurs in numerous places throughout the Nutch codebase and
consequently within the crawl cycle. This section describes the point of
occurence and functional purpose at each step.

*
[[https://svn.apache.org/repos/asf/nutch/trunk/src/java/org/apache/nutch/crawl/Injector.java|./src/java/org/apache/nutch/crawl/Injector.java]]
- Scoring filters are defined within the various MapReduce job configurations.
This means that the desired configuration will be used appropriately at runtime
when the job is run by the JobClient. The Injector actually contains two
MapReduce jobs, namely
- * sortJob - where we set the InjectMapper as the Mapreduce Mapper
override. The InjectMapper uses ScoringFilters to calculate a new initial score
for a particular URL based on passing in the Hadoop Text key (representing the
URL of the page) and associated CrawlDatum value (representing a new datum.
Filters will modify it in-place) to the ScoringFilters.injectedScore method.
Essentially this sets an initial score for newly injected pages. It should be
noted that newly injected pages may have no inlinks, so filter implementations
may wish to set this score to a non-zero value, to give newly injected pages
some initial credit.
- * mergeJob -
+ * sortJob - where we set the InjectMapper as the Mapreduce Mapper
override. The InjectMapper uses ScoringFilters to calculate a new initial score
for a particular URL based on passing in the Hadoop Text key (representing the
URL of the page) and associated CrawlDatum value (representing a new datum for
which filters will modify it in-place) to the ScoringFilters.injectedScore
method. Essentially this sets an initial score for newly injected pages. It
should be noted that newly injected pages may have no inlinks, so filter
implementations may wish to set this score to a non-zero value, to give newly
injected pages some initial credit. We are concerned with the value for
{{{db.score.injected}}} in this case as this assigns a default of 1.0f against
the score of new pages added by the injector. This default score can however be
overridden by associating the {{{nutch.score}}} metadata flag against any URL
in a seed list. This allows to set a custom score for a specific URL. If this
is the case we assign this score to the CrawlDatum object, if not then we use
the default score as described above.
+ * mergeJob - which combines multiple new entries for a given URL. An
example of when this is necessary would be if we attempt to inject two URLs
within the same seed list. In this job we are concerned with discovering the
value for the {{{db.score.injected}}} configuration property present within
{{{nutch-site.xml}}}. This value represents the score of new pages added by the
injector. In this job this is relevant for us as we must know if a record
already exists and we wish to update but not overwrite the value.
* ./src/java/org/apache/nutch/crawl/CrawlDbReducer.java
* ./src/java/org/apache/nutch/crawl/Generator.java
* ./src/java/org/apache/nutch/fetcher/Fetcher.java

[Nutch Wiki] Trivial Update of "NutchScoring" by LewisJohnMcgibbney

Reply via email to