[Nutch Wiki] Trivial Update of "NutchScoring" by LewisJohnMcgibbney

Apache Wiki Sun, 21 Sep 2014 13:18:40 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "NutchScoring" page has been changed by LewisJohnMcgibbney:
https://wiki.apache.org/nutch/NutchScoring?action=diff&rev1=11&rev2=12

  == What Scoring is... what it means in Nutch ==
   * Describe CrawlDatum data structure in Nutch trunk
  A scoring filter will manipulate scoring variables in CrawlDatum and in 
resulting search indexes. Filters can be chained in a specific order, to 
provide multi-stage scoring adjustments.
+  * 
[[https://svn.apache.org/repos/asf/nutch/trunk/src/java/org/apache/nutch/scoring/ScoringFilter.java|./src/java/org/apache/nutch/scoring/ScoringFilter.java]]
 - A scoring filter will manipulate scoring variables in CrawlDatum and in 
resulting search indexes. Filters can be chained in a specific order, to 
provide multi-stage scoring adjustments.
+  * ./src/java/org/apache/nutch/scoring/ScoringFilterException.java
+  * 
[[https://svn.apache.org/repos/asf/nutch/trunk/src/java/org/apache/nutch/scoring/ScoringFilters.java|./src/java/org/apache/nutch/scoring/ScoringFilters.java]]
 - Create and cache ScoringFilter implementing plugins.
  
  == Where Scoring takes place within the Nutch Crawl cycle ==
  Scoring occurs in numerous places throughout the Nutch codebase and 
consequently within the crawl cycle. This section describes the point of 
occurence and functional purpose Scoring serves at each step. You will see that 
the list of elements has been structured to represent the logical and typical 
progression of a Nutch crawl cycle.
@@ -27, +30 @@

   * 
[[https://svn.apache.org/repos/asf/nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java|./src/java/org/apache/nutch/crawl/Generator.java]]
 - ScoringFilters are used within the 
[[http://nutch.apache.org/apidocs/apidocs-1.9/index.html?org/apache/nutch/crawl/Generator.Selector.html|Generator.Selector]]
 class. This essentially selects URL entires due for Fetching and is the only 
functionality of the Genertor we need to cover within the context of this 
document. In addition to specifying the ScoringFilters within the MapReduce job 
configuration, we also use ScoringFilter functionality within the Map aspect of 
this job which selects and inverts a subset of URLs due for fetching. In 
particular we implement the {{{Generator.Selector.generatorSortValue}}} method 
which prepares a sort value for the purpose of sorting and selecting top N 
scoring pages during fetchlist generation. We pass in arguments for Hadoop Text 
key {{{url}}} (representing the url of the page we are trying to score), Nutch 
CrawlDatum value {{{datum}}} which represents the page's datum which should not 
be modified in this task) and an initial sort value {{{initSort}}} of 1.0f. It 
should be noted that the final value doesn't always need to be set to 1.0f as 
it can be linked to a value from previous filters in chain of Scoring 
implementations. The result of executing the 
{{{Generator.Selector.generatorSortValue}}} function is subsequently used to 
consider only entries with a score superior to the threshold which should then 
be fetched. 
   * 
[[https://svn.apache.org/repos/asf/nutch/trunk/src/java/org/apache/nutch/fetcher/Fetcher.java|./src/java/org/apache/nutch/fetcher/Fetcher.java]]
 - 
   * ./src/java/org/apache/nutch/crawl/CrawlDbReducer.java
-  * ./src/java/org/apache/nutch/fetcher/OldFetcher.java
   * ./src/java/org/apache/nutch/indexer/IndexerMapReduce.java
   * ./src/java/org/apache/nutch/parse/ParseOutputFormat.java
   * ./src/java/org/apache/nutch/parse/ParserChecker.java
   * ./src/java/org/apache/nutch/parse/ParseSegment.java
-  * ./src/java/org/apache/nutch/scoring/AbstractScoringFilter.java
-  * ./src/java/org/apache/nutch/scoring/ScoringFilter.java
-  * ./src/java/org/apache/nutch/scoring/ScoringFilterException.java
-  * ./src/java/org/apache/nutch/scoring/ScoringFilters.java
   * ./src/java/org/apache/nutch/tools/arc/ArcSegmentCreator.java
   * ./src/java/org/apache/nutch/tools/FreeGenerator.java
-  * 
./src/plugin/scoring-depth/src/java/org/apache/nutch/scoring/depth/DepthScoringFilter.java
-  * 
./src/plugin/scoring-link/src/java/org/apache/nutch/scoring/link/LinkAnalysisScoringFilter.java
-  * 
./src/plugin/scoring-opic/src/java/org/apache/nutch/scoring/opic/OPICScoringFilter.java
-  * 
./src/plugin/tld/src/java/org/apache/nutch/scoring/tld/TLDScoringFilter.java
-  * 
./src/plugin/urlmeta/src/java/org/apache/nutch/scoring/urlmeta/URLMetaScoringFilter.java
  
  
  
  == Scoring extension points ==
- 
-  * ScoringFilter - A scoring filter will manipulate scoring variables in 
CrawlDatum and in resulting search indexes. Filters can be chained in a 
specific order, to provide multi-stage scoring adjustments.
-  * ScoringFilters - Creates and caches ScoringFilter implementing plugins.
  
  == Examples ==
   * NewScoring -- New stable pagerank like webgraph and link-analysis jobs.

[Nutch Wiki] Trivial Update of "NutchScoring" by LewisJohnMcgibbney

Reply via email to