[Nutch Wiki] Update of "bin/nutch_updatedb" by LewisJohnMcgibbney

Apache Wiki Sat, 02 Jul 2011 00:18:12 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "bin/nutch_updatedb" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/bin/nutch_updatedb?action=diff&rev1=4&rev2=5

Comment:
Update to reflect Nutch 1.3 API

- updatedb is an alias for org.apache.nutch.tools.!UpdateDatabaseTool
+ Updatedb is an alias for org.apache.nutch.crawl.CrawlDb
  
- This class takes the output of the fetcher and updates the page and link DBs 
accordingly. Eventually, as the database scales, this will broken into several 
phases, each consuming and emitting batch files, but, for now, we're doing it 
all here.
+ This class takes the output of the fetcher fetcher and updates the crawldb 
accordingly. 
  
- Usage: bin/nutch org.apache.nutch.tools.!UpdateDatabaseTool (-local | -ndfs 
<namenode:port>) [-max N] [-noAdditions] <db> <seg_dir> [ <seg_dir> ... ]
+ Usage: 
+ 
+ {{{
+ CrawlDb <crawldb> (-dir <segments> | <seg1> <seg2> ...) [-force] [-normalize] 
[-filter] [-noAdditions]
+ }}}
+ 
+ '''<crawldb>''': This is the path to the crawldb directory we wish to update.
+ 
+ '''-dir <segments>''': This should be the path to the parent directory 
containing all, if several, segments to update from.
+ 
+ '''<seg1> <seg2> ...''': Here we would pass a comprehensive list of paths to 
individual segmens to update from.
+ 
+ '''[-force]''': This arguement will force an update even if the crawldb 
appears to be locked. /!\ : CAUTION: advised /!\
+ 
+ '''[-normalize]''': This arguement uses any current URLNormalizer's on urls 
in crawldb and segment (usually not needed).
+ 
+ '''[-filter]''': Pass this arguement to use any current URLFilters on urls in 
the crawldb and segment. This can provide better quality results in certain 
applications.
+ 
+ '''[-noAdditions]''': If pass this parameter the updatedb command will only 
update already existing URLs, and will not add any newly discovered URLs during 
a fetch.
+ 
  
  CommandLineOptions

[Nutch Wiki] Update of "bin/nutch_updatedb" by LewisJohnMcgibbney

Reply via email to