Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "bin/nutch_inject" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/bin/nutch_inject?action=diff&rev1=5&rev2=6

Comment:
Update to reflect Nutch 1.3 API changes

- inject is an alias for org.apache.nutch.db.!WebDBInjector
+ Inject is an alias for org.apache.nutch.crawl.Injector
  
- This class takes a flat file of URLs and adds them as entries into a web page 
& link db. Useful for bootstrapping the system.
+ This class takes a flat file of URLs and adds them to the of pages to be 
crawled. It is useful for bootstrapping the system. The URL files contain one 
URL per line, optionally followed by custom metadata separated by tabs with the 
metadata key separated from the corresponding value by '='.
  
- Usage: bin/nutch org.apache.nutch.db.!WebDBInjector (-local | -ndfs 
<namenode:port>) <db_dir> (-urlfile <url_file> | -dmozfile <dmoz_file>) 
[-subset <subsetDenominator>] [-includeAdultMaterial] [-skew skew] 
[-noDmozDesc] [-topicFile <topic list file>] [-topic <topic> [-topic <topic> 
[...]]]
+ Note that some metadata keys are reserved: 
+ 
+ ''nutch.score'': allows to set a custom score for a specific URL
+ 
+ ''nutch.fetchInterval'': allows to set a custom fetch interval for a specific 
URL 
+ 
+ e.g. http://www.abc.org/ nutch.score=10 nutch.fetchInterval=2592000 
userType=open_source
+ 
+ Usage: 
+ {{{
+ bin/nutch org.apache.nutch.crawl.Injector <crawldb> <url_dir>
+ }}}
+ 
+ '''<crawldb>''': The directory containing the crawldb
+ 
+ '''<url_dir>''': The directory containing our seed list (referred to above as 
'flat file'), usually a text document containing URLs, one URL per line.
+ 
  
  CommandLineOptions
  

Reply via email to