[Nutch Wiki] Update of "NutchScoring" by ArthurCinader

Apache Wiki Fri, 10 Oct 2014 21:38:11 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "NutchScoring" page has been changed by ArthurCinader:
https://wiki.apache.org/nutch/NutchScoring?action=diff&rev1=12&rev2=13

Comment:
Rough edit for readability.  Not intended to alter the content.

  = Nutch Scoring =
- This page is dedicated to Scoring impementations within Apache Nutch.
+ This page is dedicated to Scoring implementations within Apache Nutch.
- The language used within this document is intended to reflect that used 
within thw Nutch community and vocabulary may vary from time to time, words may 
be used interchangably to refer to the same thing, etc. If you feel there is a 
discrepancy with this document then please let us know by 
[[http://nutch.apache.org/mailing_lists.html|contacting us]].
+ The language used within this document is intended to reflect that used 
within the Nutch community and vocabulary may vary from time to time, words may 
be used interchangeably to refer to the same thing, etc. If you feel there is a 
discrepancy with this document then please let us know by 
[[http://nutch.apache.org/mailing_lists.html|contacting us]].
  
  <<TableOfContents(4)>>
  
  == Introduction ==
- Amongst other things Apache Nutch is described as "...Being pluggable and 
modular... via extensible interfaces such as Parse, Index and ScoringFilter's 
for custom implementations". This document acts as a scoring 101 for Apache 
Nutch including information on, in particular 
+ Amongst other things Apache Nutch is pluggable and modular with extensible 
interfaces.  Parse, Index and ScoringFilter can all use custom implementations. 
This document explains the basics of scoring in Apache Nutch, including 
information on:
   * What Scoring is... what it means in Nutch.
   * Where Scoring takes place within the Nutch Crawl cycle.
   * Nutch Scoring extension points and how we can implement custom scoring 
algorithms.
@@ -23, +23 @@

  
  == Where Scoring takes place within the Nutch Crawl cycle ==
  Scoring occurs in numerous places throughout the Nutch codebase and 
consequently within the crawl cycle. This section describes the point of 
occurence and functional purpose Scoring serves at each step. You will see that 
the list of elements has been structured to represent the logical and typical 
progression of a Nutch crawl cycle.
-  
+ 
   * 
[[https://svn.apache.org/repos/asf/nutch/trunk/src/java/org/apache/nutch/crawl/Injector.java|./src/java/org/apache/nutch/crawl/Injector.java]]
 - Scoring filters are defined within the various MapReduce job configurations. 
This means that the desired configuration will be used appropriately at runtime 
when the job is run by the JobClient. The Injector actually contains two 
MapReduce jobs, namely
-     * sortJob - where we set the InjectMapper as the Mapreduce Mapper 
override. The InjectMapper uses ScoringFilters to calculate a new initial score 
for a particular URL based on passing in the Hadoop Text key (representing the 
URL of the page) and associated CrawlDatum value (representing a new datum for 
which filters will modify it in-place) to the ScoringFilters.injectedScore 
method. Essentially this sets an initial score for newly injected pages. It 
should be noted that newly injected pages may have no inlinks, so filter 
implementations may wish to set this score to a non-zero value, to give newly 
injected pages some initial credit. We are concerned with the value for 
{{{db.score.injected}}} in this case as this assigns a default of 1.0f against 
the score of new pages added by the injector. This default score can however be 
overridden by associating the {{{nutch.score}}} metadata flag against any URL 
in a seed list. This allows to set a custom score for a specific URL. If this 
is the case we assign this score to the CrawlDatum object, if not then we use 
the default score as described above.
+     * sortJob - where we set the InjectMapper as the Mapreduce Mapper 
override. The InjectMapper uses ScoringFilters to calculate a new initial score 
for a particular URL based on passing in the Hadoop Text key (representing the 
URL of the page) and associated CrawlDatum value (representing a new datum for 
which filters will modify it in-place) to the ScoringFilters.injectedScore 
method. Essentially this sets an initial score for newly injected pages. It 
should be noted that newly injected pages may have no inlinks, so filter 
implementations may wish to set this score to a non-zero value, to give newly 
injected pages some initial credit. We are concerned with the value for 
{{{db.score.injected}}} in this case as this assigns a default of {{{1.0f}}} 
against the score of new pages added by the injector. This default score can 
however be overridden by associating the {{{nutch.score}}} metadata flag 
against any URL in a seed list. This allows to set a custom score for a 
specific URL. If this is the case we assign this score to the CrawlDatum 
object, if not then we use the default score as described above.
      * mergeJob - which combines multiple new entries for a given URL. An 
example of when this is necessary would be if we attempt to inject two 
indentical URLs within the same seed list and where these should be merged into 
one record. In this job we are concerned with discovering the value for the 
{{{db.score.injected}}} configuration property present within 
{{{nutch-site.xml}}} as populated in the initial sortJob execution as described 
above. This value represents the score of new pages added by the Injector. This 
is relevant for us as we must know if a record already exists and we wish to 
update but not overwrite the value.
-  * 
[[https://svn.apache.org/repos/asf/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java|./src/java/org/apache/nutch/crawl/Generator.java]]
 - ScoringFilters are used within the 
[[http://nutch.apache.org/apidocs/apidocs-1.9/index.html?org/apache/nutch/crawl/Generator.Selector.html|Generator.Selector]]
 class. This essentially selects URL entires due for Fetching and is the only 
functionality of the Genertor we need to cover within the context of this 
document. In addition to specifying the ScoringFilters within the MapReduce job 
configuration, we also use ScoringFilter functionality within the Map aspect of 
this job which selects and inverts a subset of URLs due for fetching. In 
particular we implement the {{{Generator.Selector.generatorSortValue}}} method 
which prepares a sort value for the purpose of sorting and selecting top N 
scoring pages during fetchlist generation. We pass in arguments for Hadoop Text 
key {{{url}}} (representing the url of the page we are trying to score), Nutch 
CrawlDatum value {{{datum}}} which represents the page's datum which should not 
be modified in this task) and an initial sort value {{{initSort}}} of 1.0f. It 
should be noted that the final value doesn't always need to be set to 1.0f as 
it can be linked to a value from previous filters in chain of Scoring 
implementations. The result of executing the 
{{{Generator.Selector.generatorSortValue}}} function is subsequently used to 
consider only entries with a score superior to the threshold which should then 
be fetched. 
+  * 
[[https://svn.apache.org/repos/asf/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java|./src/java/org/apache/nutch/crawl/Generator.java]]
 - ScoringFilters are used within the 
[[http://nutch.apache.org/apidocs/apidocs-1.9/index.html?org/apache/nutch/crawl/Generator.Selector.html|Generator.Selector]]
 class selects URL entries due for Fetching. In addition to specifying the 
ScoringFilters within the MapReduce job configuration, ScoringFilter is used 
within the Map phase of this job which selects and inverts a subset of URLs due 
for fetching. In particular we implement the 
{{{Generator.Selector.generatorSortValue}}} method which prepares a sort value 
for sorting and selecting the top N scoring pages during fetchlist generation. 
The arguments for Hadoop are: Text key {{{url}}} representing the url of the 
page we are trying to score, CrawlDatum {{{datum}}} which represents the page's 
datum which should not be modified in this task, and an initial sort value 
{{{initSort}}} of {{{1.0f}}}. It should be noted that the final value doesn't 
always need to be set to {{{1.0f}}} as it can be linked to a value from 
previous filters in the chain of Scoring implementations. The result of 
executing the {{{Generator.Selector.generatorSortValue}}} function is 
subsequently used to consider only entries with a score superior to the 
threshold.  Only URL entries above the threshold will then be fetched.
-  * 
[[https://svn.apache.org/repos/asf/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java|./src/java/org/apache/nutch/fetcher/Fetcher.java]]
 - 
+  * 
[[https://svn.apache.org/repos/asf/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java|./src/java/org/apache/nutch/fetcher/Fetcher.java]]
 -
   * ./src/java/org/apache/nutch/crawl/CrawlDbReducer.java
   * ./src/java/org/apache/nutch/indexer/IndexerMapReduce.java
   * ./src/java/org/apache/nutch/parse/ParseOutputFormat.java

[Nutch Wiki] Update of "NutchScoring" by ArthurCinader

Reply via email to