[Nutch Wiki] Update of "bin/nutch_generate" by LewisJohnMcgibbney

Apache Wiki Thu, 30 Jun 2011 15:09:04 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "bin/nutch_generate" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/bin/nutch_generate?action=diff&rev1=7&rev2=8

Comment:
Update to reflect 1.3 API changes

- generate is an alias for org.apache.nutch.tools.!FetchListTool
+ Generate is an alias for org.apache.nutch.crawl.Generator
  
- The generate command is used to create a new fetchlist from the webdb which 
contains urls which can be fetched using the fetch tool.
+ This class generates a subset of a crawl db to fetch. This version allows us 
to generate fetchlists for several segments in one go. Unlike in the initial 
version (FetchListTool), the IP resolution is done ONLY on the entries which 
have been selected for fetching. The URLs are partitioned by IP, domain or host 
within a segment. We can chose separately how to count the URLS i.e. by domain 
or host to limit the entries.
  
- Usage: bin/nutch org.apache.nutch.tools.!FetchListTool (-local | -ndfs 
<namenode:port>)<<BR>>  <db> <segment_dir> [-refetchonly] [-anchoroptimize 
linkdb] [-topN N] <<BR>>  [-cutoff cutoffscore] [-numFetchers numFetchers] 
[-adddays numDays]
+ Usage: bin/nutch org.apache.nutch.crawl.Generator <crawldb> <segments_dir> 
[-force] [-topN N] [-numFetchers numFetchers] [-adddays numDays] [-noFilter] 
[-noNorm][-maxNumSegments num]
  
- Command line parameters:
+ '''<crawldb>''': Path to the location of our crawldb directory.
  
- '''-topN N''' where N is a number of pages.
+ '''<segments_dir>''': Path to the location of our segments directory where 
the Fetcher Segments are created.
  
- Normally, the "generate" command prepares a fetchlist out of
- all unfetched pages, or the ones where fetch interval already expired.
- But if you use -topN, then instead of all unfetched urls you only get N
- urls with the highest score - potentially the most interesting ones,
- which should be prioritized in fetching.
+ '''[-force]''': 
+ 
+ '''[-topN N]''': Where N is the number of top URLs to be selected. Normally, 
the "generate" command prepares a fetchlist out of all unfetched pages, or the 
ones where fetch interval already expired. But if you use -topN, then instead 
of all unfetched urls you only get N urls with the highest score - potentially 
the most interesting ones, which should be prioritized in fetching.
+ 
+ '''[-numFetchers numFetchers]''': The number of fetch partitions. Default: 
Configuration key -> mapred.map.tasks -> 1
+ 
+ '''[-adddays numDays]''': Adds <days> to the current time to facilitate 
crawling urls already fetched sooner then db.default.fetch.interval. Default: 0
+ 
+ '''[-noFilter]''':Whether to filter URLs or not is read from the 
crawl.generate.filter property in nutch-site.xml/nutch-default.xml 
configuration files. If the property is not found, the URLs are filtered. Same 
for the normalisation 
+ 
+ '''[-noNorm]''': The exact same applies for normalisation parameter as does 
for the filtering option above.
+ 
+ '''[-maxNumSegments num''':
+ 
+ === Configuration Files ===
+  hadoop-default.xml<<BR>>
+  hadoop-site.xml<<BR>>
+  nutch-default.xml<<BR>>
+  nutch-site.xml<<BR> 
+ 
+ === Configuration Values ===
+  The following properties directory affect how the Generator generates fetch 
segments.<<BR>><<BR>>
+  * generate.max.count: The maximum number of urls in a single fetchlist.  -1 
if unlimited. The urls are counted according to the value of the parameter 
generator.count.mode.
+  
+  * generate.count.mode: Determines how the URLs are counted for 
generator.max.count. Default value is 'host' but can be 'domain'. Note that we 
do not count per IP in the new version of the Generator.
+   
+ === Examples ===
+ 
+ {{{
+ bin/nutch org.apache.nutch.crawl.Generator /my/crawldb /my/segments
+ }}}
+  This example will generate a fetch list that contains all URLs ready to be 
fetched from the crawldb. The crawldb is located at my/crawldb and the 
generator will output the fetch list to /my/segments/yyyyMMddHHmmss.
+ 
+ {{{
+ bin/nutch org.apache.nutch.crawl.Generator /my/crawldb /my/segments -topN 100 
-adddays 20
+ }}}
+  In this example the Generator will add 20 days to the current date/time when 
determining the top 100 scoring pages to fetch.
+ 
  
  CommandLineOptions
  
-  - Juho Mäkinen
-

[Nutch Wiki] Update of "bin/nutch_generate" by LewisJohnMcgibbney

Reply via email to