Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "bin/nutch_fetch" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/bin/nutch_fetch?action=diff&rev1=4&rev2=5

Comment:
Update to reflect Nutch 1.3 API

- fetch is an alias for org.apache.nutch.fetcher.Fetcher
+ Fetch is an alias for org.apache.nutch.fetcher.Fetcher
  
- The fetcher. Most of the work is done by plugins.
+ This fetcher uses a well-known model of one producer (a QueueFeeder) and many 
consumers (FetcherThread-s).
  
- Usage: bin/nutch org.apache.nutch.fetcher.Fetcher (-local | -ndfs 
<namenode:port>) [-logLevel level] [-noParsing] [-showThreadID] [-threads n] 
<dir>
+ QueueFeeder reads input fetchlists and populates a set of FetchItemQueue-s, 
which hold FetchItem-s that describe the items to be fetched. There are as many 
queues as there are unique hosts, but at any given time the total number of 
fetch items in all queues is less than a fixed number (currently set to a 
multiple of the number of threads).
+  
+ As items are consumed from the queues, the QueueFeeder continues to add new 
input items, so that their total count stays fixed (FetcherThread-s may also 
add new items to the queues e.g. as a results of redirection) - until all input 
items are exhausted, at which point the number of items in the queues begins to 
decrease. When this number reaches 0 fetcher will finish.
+ 
+ This fetcher implementation handles per-host blocking itself, instead of 
delegating this work to protocol-specific plugins. Each per-host queue handles 
its own "politeness" settings, such as the maximum number of concurrent 
requests and crawl delay between consecutive requests - and also a list of 
requests in progress, and the time the last request was finished. As 
FetcherThread-s ask for new items to be fetched, queues may return eligible 
items or null if for "politeness" reasons this host's queue is not yet ready.
+ 
+ If there are still unfetched items in the queues, but none of the items are 
ready, FetcherThread-s will spin-wait until either some items become available, 
or a timeout is reached (at which point the Fetcher will abort, assuming the 
task is hung).
+ 
+ {{{
+ Usage: bin/nutch fetch <segment> [-threads n] [-noParsing]
+ }}}
+ 
+ '''<segment>''': This is the path to the previously generated segment 
directory we wish to fetch.
+ 
+ '''[-threads n]''': This arguement invokes the number of threads we wish to 
work concurrently on fetching URLs in the desired segment e.g. the number of 
fetcher threads the fetcher should use. This is also determines the maximum 
number of requests that are made at once (each fetcher thread handles one 
connection).
+ 
+ '''[-noParsing]''': If no arguement is passed this value is the default. This 
is the case due to errors which can occur when parsing segments. If errors 
occur then the results of the whole fetching process can be corrupted. Note 
that parsing will only follow meta-redirects coming from the original URL.
  
  CommandLineOptions
  

Reply via email to