Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "bin/nutch generate" page has been changed by SebastianNagel:
https://wiki.apache.org/nutch/bin/nutch%20generate?action=diff&rev1=5&rev2=6

Comment:
Add hint about number of reducers influencing topN, cf.  
http://mail-archives.apache.org/mod_mbox/nutch-user/201604.mbox/%[email protected]%3E

   nutch-site.xml<<BR>>
  
  ==== Configuration Values ====
-  The following properties directory affect how the Generator generates fetch 
segments.<<BR>><<BR>>
+  The following properties directly affect how the Generator generates fetch 
segments.<<BR>><<BR>>
   * generate.max.count: The maximum number of urls in a single fetchlist.  -1 
if unlimited. The urls are counted according to the value of the parameter 
generator.count.mode.
   
   * generate.count.mode: Determines how the URLs are counted for 
generator.max.count. Default value is 'host' but can be 'domain'. Note that we 
do not count per IP in the new version of the Generator.
+ 
+  * partition.url.mode: Determines how URLs are distributed over fetch 
partitions: "byHost" (default), "byDomain", or "byIP". Cf. the corresponding 
property "fetcher.queue.mode" in Fetcher used to guarantee delays between 
successive fetch requests to the same host/domain/IP.
+ 
+  Indirectly, the behavior of Generator is influenced by:<<BR>><<BR>>
+  * mapreduce.job.reduces: In a distributed environment (Hadoop) with multiple 
reducers the max. total number of URLs (-topN) is applied per reduce task as 
(topN/numReduceTasks). If URLs are not evenly spread over hosts (domains or 
IPs, see partition.url.mode) or belong to a single host/domain/IP, some 
partitions may be smaller than expected or even empty. The total number of 
generated URLs is then lower than topN.
    
  ==== Examples ====
  

Reply via email to