[Nutch Wiki] Update of "bin/nutch generate" by TejasPatil

Apache Wiki Sat, 27 Apr 2013 14:19:49 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "bin/nutch generate" page has been changed by TejasPatil:
http://wiki.apache.org/nutch/bin/nutch%20generate?action=diff&rev1=1&rev2=2

Comment:
added the usage for generate in 2.x

  
  This class generates a subset of a crawl db to fetch. This version allows us 
to generate fetchlists for several segments in one go. Unlike in the initial 
version (FetchListTool), the IP resolution is done ONLY on the entries which 
have been selected for fetching. The URLs are partitioned by IP, domain or host 
within a segment. We can chose separately how to count the URLS i.e. by domain 
or host to limit the entries.
  
+ === Nutch 1.x ===
  {{{
  Usage: bin/nutch generate <crawldb> <segments_dir> [-force] [-topN N] 
[-numFetchers numFetchers] [-adddays numDays] [-noFilter] 
[-noNorm][-maxNumSegments num]
  }}}
@@ -26, +27 @@

  
  '''[-maxNumSegments num''':
  
- === Configuration Files ===
+ ==== Configuration Files ====
   hadoop-default.xml<<BR>>
   hadoop-site.xml<<BR>>
   nutch-default.xml<<BR>>
   nutch-site.xml<<BR>>
  
- === Configuration Values ===
+ ==== Configuration Values ====
   The following properties directory affect how the Generator generates fetch 
segments.<<BR>><<BR>>
   * generate.max.count: The maximum number of urls in a single fetchlist.  -1 
if unlimited. The urls are counted according to the value of the parameter 
generator.count.mode.
   
   * generate.count.mode: Determines how the URLs are counted for 
generator.max.count. Default value is 'host' but can be 'domain'. Note that we 
do not count per IP in the new version of the Generator.
    
- === Examples ===
+ ==== Examples ====
  
  {{{
  bin/nutch org.apache.nutch.crawl.Generator /my/crawldb /my/segments
@@ -50, +51 @@

  }}}
   In this example the Generator will add 20 days to the current date/time when 
determining the top 100 scoring pages to fetch.
  
+ === Nutch 2.x ===
+ {{{
+ Usage: GeneratorJob [-topN N] [-crawlId id] [-noFilter] [-noNorm] [-adddays 
numDays]
+     -topN <N>      - number of top URLs to be selected, default is 
Long.MAX_VALUE 
+     -crawlId <id>  - the id to prefix the schemas to operate on, 
+                   (default: storage.crawl.id)");
+     -noFilter      - do not activate the filter plugin to filter the url, 
default is true 
+     -noNorm        - do not activate the normalizer plugin to normalize the 
url, default is true 
+     -adddays       - Adds numDays to the current time to facilitate crawling 
urls already
+                      fetched sooner then db.default.fetch.interval. Default 
value is 0.
+ }}}
  
  CommandLineOptions

[Nutch Wiki] Update of "bin/nutch generate" by TejasPatil

Reply via email to