Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "bin/crawl" page has been changed by SebastianNagel: https://wiki.apache.org/nutch/bin/crawl?action=diff&rev1=2&rev2=3 Comment: Update to recent version (1.15) of bin/crawl = Usage = == Nutch 1.X == {{{ - Usage: crawl [-i|--index] [-D "key=value"] <Seed Dir> <Crawl Dir> <Num Rounds> + Usage: crawl [options] <crawl_dir> <num_rounds> + + Arguments: + <crawl_dir> Directory where the crawl/host/link/segments dirs are saved + <num_rounds> The number of rounds to run this crawl for + + Options: - -i|--index Indexes crawl results into a configured indexer + -i|--index Indexes crawl results into a configured indexer - -D A Java property to pass to Nutch calls + -D A Java property to pass to Nutch calls - Seed Dir Directory in which to look for a seeds file - Crawl Dir Directory where the crawl/link/segments dirs are saved - Num Rounds The number of rounds to run this crawl for - Example: bin/crawl -i -D solr.server.url=http://localhost:8983/solr/ urls/ TestCrawl/ 2 + -w|--wait <NUMBER[SUFFIX]> Time to wait before generating a new segment when no URLs + are scheduled for fetching. Suffix can be: s for second, + m for minute, h for hour and d for day. If no suffix is + specified second is used by default. [default: -1] + -s <seed_dir> Path to seeds file(s) + -sm <sitemap_dir> Path to sitemap URL file(s) + --hostdbupdate Boolean flag showing if we either update or not update hostdb for each round + --hostdbgenerate Boolean flag showing if we use hostdb in generate or not + --num-slaves <num_slaves> Number of slave nodes [default: 1] + Note: This can only be set when running in distribution mode + --num-tasks <num_tasks> Number of reducer tasks [default: 2] + --size-fetchlist <size_fetchlist> Number of URLs to fetch in one iteration [default: 50000] + --time-limit-fetch <time_limit_fetch> Number of minutes allocated to the fetching [default: 180] + --num-threads <num_threads> Number of threads for fetching / sitemap processing [default: 50] + --sitemaps-from-hostdb <frequency> Whether and how often to process sitemaps based on HostDB. + Supported values are: + - never [default] + - always (processing takes place in every iteration) + - once (processing only takes place in the first iteration) }}} == Nutch 2.x ==

