Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The following page has been changed by JeffRitchie:
http://wiki.apache.org/nutch/nutch-0%2e8-dev/bin/nutch_crawl

------------------------------------------------------------------------------
  
  == Perform complete crawling and indexing given a set of root urls. ==
  
+ '''Configuration Files Used:''' 
+  hadoop-default.xml[[BR]]
+  hadoop-site.xml[[BR]]
+  crawl-tool.xml[[BR]]
+ 
  '''Usage:''' nutch-0.8-dev/bin/nutch org.apache.nutch.crawl.Crawl <urlDir> 
[-dir d] [-threads n] [-depth i] [-topN]
  
- '''<urlDir>:''' contains text files with URL lists. This must be an existing 
directory.
+ '''<urlDir>:''' contains text files with URL lists. This must be an existing 
directory.  Default Value: ''None''
  
+ '''[-dir <d>]:''' The directory where Nutch will save the crawl files.  
Default Value: ''./crawl-[date]'' where [date] is the current date.
- '''[-dir d]:''' You can choose the directory, where Nutch should save the 
index.
- If you don’t choose a directory Nutch would create a own directory in the 
directory where you started the crawl.
- Example of a –dir parameter: -dir /usr/local/index/ 
  
- '''[-threads n]:''' ''<need description>''
+ '''[-threads <n>]:''' Number of Fetcher Threads to use.  Overrides the 
configuration key ''fetcher.threads.fetch''.  Default Value: ''10''
  
+ '''[-depth <i>]:''' Number of iterations Nutch should crawl. Default Value: 
''5''
- '''[-depth i]:''' You can tell Nutch how deep it should crawl. If you don’t 
tell Nutch a value, it takes 3 as his standard parameter. 
- For example if you say –depth 1, Nutch would only index the first level. 
Only if you say –depth 2 (or more) Nutch would make a link follow.
  
- '''[-topN]:''' ''<need description>''
+ '''[-topN <num>]:''' Limit crawls to the top <num> links per iteration.  
Default Value: ''Integer.MAX_VALUE''
  
  DevelopmentCommandLineOptions
  

Reply via email to