[Nutch Wiki] Update of "Nutch2Crawling" by FerdyGalema

Apache Wiki Wed, 27 Jun 2012 00:58:40 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "Nutch2Crawling" page has been changed by FerdyGalema:
http://wiki.apache.org/nutch/Nutch2Crawling?action=diff&rev1=2&rev2=3

Comment:
fix typo

   * GeneratorJob
   * FetcherJob
   * ParserJob (optionally done during fetch using 'fetcher.parse')
-  * DbUpdateJob
+  * DbUpdaterJob
  To populate initial rows for the webtable you can use the InjectorJob.
  
  There is a single table '''webpage''' that is the input and output for these 
jobs. Every row in this table is an url (WebPage). To group urls from the same 
TLD and domain closely together, the row key is stored as url with '''reversed 
host components'''. This takes advantage of the fact that row keys are sorted 
(in most NoSQL stores). Scanning over a subset is generally a lot faster than 
scanning over the entire table with specific rowkey filtering. See the 
following example rowkey listing:

[Nutch Wiki] Update of "Nutch2Crawling" by FerdyGalema

Reply via email to