Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "bin/nutch_generate" page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/bin/nutch_generate?action=diff&rev1=7&rev2=8 Comment: Update to reflect 1.3 API changes - generate is an alias for org.apache.nutch.tools.!FetchListTool + Generate is an alias for org.apache.nutch.crawl.Generator - The generate command is used to create a new fetchlist from the webdb which contains urls which can be fetched using the fetch tool. + This class generates a subset of a crawl db to fetch. This version allows us to generate fetchlists for several segments in one go. Unlike in the initial version (FetchListTool), the IP resolution is done ONLY on the entries which have been selected for fetching. The URLs are partitioned by IP, domain or host within a segment. We can chose separately how to count the URLS i.e. by domain or host to limit the entries. - Usage: bin/nutch org.apache.nutch.tools.!FetchListTool (-local | -ndfs <namenode:port>)<<BR>> <db> <segment_dir> [-refetchonly] [-anchoroptimize linkdb] [-topN N] <<BR>> [-cutoff cutoffscore] [-numFetchers numFetchers] [-adddays numDays] + Usage: bin/nutch org.apache.nutch.crawl.Generator <crawldb> <segments_dir> [-force] [-topN N] [-numFetchers numFetchers] [-adddays numDays] [-noFilter] [-noNorm][-maxNumSegments num] - Command line parameters: + '''<crawldb>''': Path to the location of our crawldb directory. - '''-topN N''' where N is a number of pages. + '''<segments_dir>''': Path to the location of our segments directory where the Fetcher Segments are created. - Normally, the "generate" command prepares a fetchlist out of - all unfetched pages, or the ones where fetch interval already expired. - But if you use -topN, then instead of all unfetched urls you only get N - urls with the highest score - potentially the most interesting ones, - which should be prioritized in fetching. + '''[-force]''': + + '''[-topN N]''': Where N is the number of top URLs to be selected. Normally, the "generate" command prepares a fetchlist out of all unfetched pages, or the ones where fetch interval already expired. But if you use -topN, then instead of all unfetched urls you only get N urls with the highest score - potentially the most interesting ones, which should be prioritized in fetching. + + '''[-numFetchers numFetchers]''': The number of fetch partitions. Default: Configuration key -> mapred.map.tasks -> 1 + + '''[-adddays numDays]''': Adds <days> to the current time to facilitate crawling urls already fetched sooner then db.default.fetch.interval. Default: 0 + + '''[-noFilter]''':Whether to filter URLs or not is read from the crawl.generate.filter property in nutch-site.xml/nutch-default.xml configuration files. If the property is not found, the URLs are filtered. Same for the normalisation + + '''[-noNorm]''': The exact same applies for normalisation parameter as does for the filtering option above. + + '''[-maxNumSegments num''': + + === Configuration Files === + hadoop-default.xml<<BR>> + hadoop-site.xml<<BR>> + nutch-default.xml<<BR>> + nutch-site.xml<<BR> + + === Configuration Values === + The following properties directory affect how the Generator generates fetch segments.<<BR>><<BR>> + * generate.max.count: The maximum number of urls in a single fetchlist. -1 if unlimited. The urls are counted according to the value of the parameter generator.count.mode. + + * generate.count.mode: Determines how the URLs are counted for generator.max.count. Default value is 'host' but can be 'domain'. Note that we do not count per IP in the new version of the Generator. + + === Examples === + + {{{ + bin/nutch org.apache.nutch.crawl.Generator /my/crawldb /my/segments + }}} + This example will generate a fetch list that contains all URLs ready to be fetched from the crawldb. The crawldb is located at my/crawldb and the generator will output the fetch list to /my/segments/yyyyMMddHHmmss. + + {{{ + bin/nutch org.apache.nutch.crawl.Generator /my/crawldb /my/segments -topN 100 -adddays 20 + }}} + In this example the Generator will add 20 days to the current date/time when determining the top 100 scoring pages to fetch. + CommandLineOptions - - Juho Mäkinen -

