[Nutch Wiki] Trivial Update of "NewScoringIndexingExample" by LewisJohnMcgibbney

Apache Wiki Wed, 28 Oct 2015 13:17:58 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "NewScoringIndexingExample" page has been changed by LewisJohnMcgibbney:
https://wiki.apache.org/nutch/NewScoringIndexingExample?action=diff&rev1=6&rev2=7

  
  '''N.B.''' This page and the functionality described within is only 
applicable and relevant to Nutch 1.X.
  
+ <<TableOfContents(4)>>
+ 
+ = Introduction =
+ 
  Below is an example of running the new scoring and indexing systems from 
start to finish.  This was done with a sample of 1000 urls and I ran two 
different fetch cycles.  The first being 1000 urls and the second being the top 
2000 urls.  The loops job is optional but included for completeness.  In 
production we have actually removed that job.  This was done with a clean pull 
from Nutch trunk as of 2009-03-06 (right before 1.0 is set to be released).  If 
anybody has any problems running these commands or has questions send me an 
email or send one to the nutch users or dev list and I will reply.  Please send 
it to kubes at the apache address dot org.
  
+ = Workflow =
  
  {{{
  bin/nutch inject crawl/crawldb crawl/urls/
@@ -18, +23 @@

  }}}
  
  One thing to point out here is that WebGraph is meant to be used on larger 
web crawls to create web graphs.  By default it ignores outlinks to pages in 
the same domain, including subdomains, and pages with the same hostname.  It 
also limits to one outlink per page to links in the same page or the same 
domain.  All of these options are changeable through the following 
configuration options:
+ 
+ = Configuration =
  
  {{{
  
@@ -47, +54 @@

  </property> 
  
  }}}
+ 
+ = Additional WebGraph Classes =
  
  But by default if you are only crawling pages within a domain or within a set 
of subdomains, all outlinks will be ignored and you will come up with an empty 
webgraph.  This in turn will throw an error while processing through the 
LinkRank job.  The flip side is by NOT ignoring links to the same domain/host 
and by not limiting those links, the webgraph becomes much, much more dense and 
hence there is a lot more links to process which probably won't affect 
relevancy as much.
  
@@ -163, +172 @@

  bin/nutch org.apache.nutch.indexer.field.FieldIndexer -fields 
crawl/fields/basicfields/ -fields crawl/fields/anchorfields/ -output 
crawl/indexes
  }}}
  
+ = Class Diagram =
+

[Nutch Wiki] Trivial Update of "NewScoringIndexingExample" by LewisJohnMcgibbney

Reply via email to