[Nutch Wiki] Update of "bin/nutch generate" by kiranchitturi

Apache Wiki Wed, 20 Mar 2013 11:08:24 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "bin/nutch generate" page has been changed by kiranchitturi:
http://wiki.apache.org/nutch/bin/nutch%20generate

New page:
Generate is an alias for org.apache.nutch.crawl.Generator

This class generates a subset of a crawl db to fetch. This version allows us to 
generate fetchlists for several segments in one go. Unlike in the initial 
version (FetchListTool), the IP resolution is done ONLY on the entries which 
have been selected for fetching. The URLs are partitioned by IP, domain or host 
within a segment. We can chose separately how to count the URLS i.e. by domain 
or host to limit the entries.

{{{
Usage: bin/nutch generate <crawldb> <segments_dir> [-force] [-topN N] 
[-numFetchers numFetchers] [-adddays numDays] [-noFilter] 
[-noNorm][-maxNumSegments num]
}}}

'''<crawldb>''': Path to the location of our crawldb directory.

'''<segments_dir>''': Path to the location of our segments directory where the 
Fetcher Segments are created.

'''[-force]''': This arguement will force an update even if there appears to be 
a lock. /!\ : CAUTION: advised /!\

'''[-topN N]''': Where N is the number of top URLs to be selected. Normally, 
the "generate" command prepares a fetchlist out of all unfetched pages, or the 
ones where fetch interval already expired. But if you use -topN, then instead 
of all unfetched urls you only get N urls with the highest score - potentially 
the most interesting ones, which should be prioritized in fetching.

'''[-numFetchers numFetchers]''': The number of fetch partitions. Default: 
Configuration key -> mapred.map.tasks -> 1 (in local mode), possibly multiple 
in deploy/distributed mode.

'''[-adddays numDays]''': Adds <days> to the current time to facilitate 
crawling urls already fetched sooner then db.default.fetch.interval. Default: 0

'''[-noFilter]''':Whether to filter URLs or not is read from the 
crawl.generate.filter property in nutch-site.xml/nutch-default.xml 
configuration files. If the property is not found, the URLs are filtered. Same 
for the normalisation 

'''[-noNorm]''': The exact same applies for normalisation parameter as does for 
the filtering option above.

'''[-maxNumSegments num''':

=== Configuration Files ===
 hadoop-default.xml<<BR>>
 hadoop-site.xml<<BR>>
 nutch-default.xml<<BR>>
 nutch-site.xml<<BR>>

=== Configuration Values ===
 The following properties directory affect how the Generator generates fetch 
segments.<<BR>><<BR>>
 * generate.max.count: The maximum number of urls in a single fetchlist.  -1 if 
unlimited. The urls are counted according to the value of the parameter 
generator.count.mode.
 
 * generate.count.mode: Determines how the URLs are counted for 
generator.max.count. Default value is 'host' but can be 'domain'. Note that we 
do not count per IP in the new version of the Generator.
  
=== Examples ===

{{{
bin/nutch org.apache.nutch.crawl.Generator /my/crawldb /my/segments
}}}
 This example will generate a fetch list that contains all URLs ready to be 
fetched from the crawldb. The crawldb is located at my/crawldb and the 
generator will output the fetch list to /my/segments/yyyyMMddHHmmss.

{{{
bin/nutch org.apache.nutch.crawl.Generator /my/crawldb /my/segments -topN 100 
-adddays 20
}}}
 In this example the Generator will add 20 days to the current date/time when 
determining the top 100 scoring pages to fetch.


CommandLineOptions

[Nutch Wiki] Update of "bin/nutch generate" by kiranchitturi

Reply via email to