[Nutch Wiki] Trivial Update of "bin/nutch_generate" by SebastianNagel

Apache Wiki Tue, 26 Apr 2016 04:48:08 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "bin/nutch_generate" page has been changed by SebastianNagel:
https://wiki.apache.org/nutch/bin/nutch_generate?action=diff&rev1=13&rev2=14

Comment:
typo + same case for all occurrences of "URLs"

  Generate is an alias for org.apache.nutch.crawl.Generator
  
- This class generates a subset of a crawl db to fetch. This version allows us 
to generate fetchlists for several segments in one go. Unlike in the initial 
version (FetchListTool), the IP resolution is done ONLY on the entries which 
have been selected for fetching. The URLs are partitioned by IP, domain or host 
within a segment. We can chose separately how to count the URLS i.e. by domain 
or host to limit the entries.
+ This class generates a subset of a crawl db to fetch. This version allows us 
to generate fetchlists for several segments in one go. Unlike in the initial 
version (FetchListTool), the IP resolution is done ONLY on the entries which 
have been selected for fetching. The URLs are partitioned by IP, domain or host 
within a segment. We can chose separately how to count the URLs i.e. by domain 
or host to limit the entries.
  
  {{{
  Usage: bin/nutch generate <crawldb> <segments_dir> [-force] [-topN N] 
[-numFetchers numFetchers] [-adddays numDays] [-noFilter] 
[-noNorm][-maxNumSegments num]
  }}}
- 
  '''<crawldb>''': Path to the location of our crawldb directory.
  
  '''<segments_dir>''': Path to the location of our segments directory where 
the Fetcher Segments are created.
  
- '''[-force]''': This arguement will force an update even if there appears to 
be a lock. /!\ : CAUTION: advised /!\
+ '''[-force]''': This argument will force an update even if there appears to 
be a lock. /!\ : CAUTION: advised /!\
  
- '''[-topN N]''': Where N is the number of top URLs to be selected. Normally, 
the "generate" command prepares a fetchlist out of all unfetched pages, or the 
ones where fetch interval already expired. But if you use -topN, then instead 
of all unfetched urls you only get N urls with the highest score - potentially 
the most interesting ones, which should be prioritized in fetching.
+ '''[-topN N]''': Where N is the number of top URLs to be selected. Normally, 
the "generate" command prepares a fetchlist out of all unfetched pages, or the 
ones where fetch interval already expired. But if you use -topN, then instead 
of all unfetched URLs you only get N URLs with the highest score - potentially 
the most interesting ones, which should be prioritized in fetching.
  
  '''[-numFetchers numFetchers]''': The number of fetch partitions. Default: 
Configuration key -> mapred.map.tasks -> 1 (in local mode), possibly multiple 
in deploy/distributed mode.
  
- '''[-adddays numDays]''': Adds <days> to the current time to facilitate 
crawling urls already fetched sooner then db.default.fetch.interval. Default: 0
+ '''[-adddays numDays]''': Adds <days> to the current time to facilitate 
crawling URLs already fetched sooner then db.default.fetch.interval. Default: 0
  
- '''[-noFilter]''':Whether to filter URLs or not is read from the 
crawl.generate.filter property in nutch-site.xml/nutch-default.xml 
configuration files. If the property is not found, the URLs are filtered. Same 
for the normalisation 
+ '''[-noFilter]''':Whether to filter URLs or not is read from the 
crawl.generate.filter property in nutch-site.xml/nutch-default.xml 
configuration files. If the property is not found, the URLs are filtered. Same 
for the normalisation
  
  '''[-noNorm]''': The exact same applies for normalisation parameter as does 
for the filtering option above.
  
  '''[-maxNumSegments num''':
  
  === Configuration Files ===
+  . hadoop-default.xml<<BR>> hadoop-site.xml<<BR>> nutch-default.xml<<BR>> 
nutch-site.xml<<BR>>
-  hadoop-default.xml<<BR>>
-  hadoop-site.xml<<BR>>
-  nutch-default.xml<<BR>>
-  nutch-site.xml<<BR>>
  
  === Configuration Values ===
-  The following properties directory affect how the Generator generates fetch 
segments.<<BR>><<BR>>
+  . The following properties directory affect how the Generator generates 
fetch segments.<<BR>><<BR>>
-  * generate.max.count: The maximum number of urls in a single fetchlist.  -1 
if unlimited. The urls are counted according to the value of the parameter 
generator.count.mode.
+  * generate.max.count: The maximum number of URLs in a single fetchlist.  -1 
if unlimited. The URLs are counted according to the value of the parameter 
generator.count.mode.
-  
+ 
   * generate.count.mode: Determines how the URLs are counted for 
generator.max.count. Default value is 'host' but can be 'domain'. Note that we 
do not count per IP in the new version of the Generator.
-   
+ 
  === Examples ===
- 
  {{{
  bin/nutch org.apache.nutch.crawl.Generator /my/crawldb /my/segments
  }}}
-  This example will generate a fetch list that contains all URLs ready to be 
fetched from the crawldb. The crawldb is located at my/crawldb and the 
generator will output the fetch list to /my/segments/yyyyMMddHHmmss.
+  . This example will generate a fetch list that contains all URLs ready to be 
fetched from the crawldb. The crawldb is located at my/crawldb and the 
generator will output the fetch list to /my/segments/yyyyMMddHHmmss.
  
  {{{
  bin/nutch org.apache.nutch.crawl.Generator /my/crawldb /my/segments -topN 100 
-adddays 20
  }}}
-  In this example the Generator will add 20 days to the current date/time when 
determining the top 100 scoring pages to fetch.
+  . In this example the Generator will add 20 days to the current date/time 
when determining the top 100 scoring pages to fetch.
- 
  
  CommandLineOptions

[Nutch Wiki] Trivial Update of "bin/nutch_generate" by SebastianNagel

Reply via email to