Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "bin/nutch crawl" page has been changed by kiranchitturi:
http://wiki.apache.org/nutch/bin/nutch%20crawl

Comment:
change of url from last crawl page

New page:
Crawl is an alias for org.apache.nutch.crawl.Crawl

This class performs a complete crawl given a set of root urls.

Usage: 
{{{
bin/nutch crawl <urlDir> [-solr <solrURL>] [-dir d] [-threads n] [-depth i] 
[-topN N]
}}}

'''<urlDir>''': Contains text files with URL lists. This must be an existing 
directory. Example would be ${NUTCH_HOME}/urls

'''[-solr <solrURL>]''': Enables us to pass our Solr instance as an indexing 
parameter to simplify the process of indexing with Solr.

'''[-dir d]''': This parameter enables you to choose the directory Nutch should 
use when crawling.

'''[-threads n]''': This parameter enables you to choose how many threads Nutch 
should use when crawling.

'''[-depth i]''': You can tell Nutch how deep it should crawl. If you don’t 
tell Nutch a value, it takes 5 as his standard parameter. 
For example if you pass –depth 1 as the parameter, Nutch will only index the 
first level. If you say –depth 2 (or more) Nutch will follow this number of 
outlinks.

'''[-topN N]''': The maximum number of outlinks Nutch will obtain from any one 
page.

CommandLineOptions

Reply via email to