[Nutch Wiki] Trivial Update of "FAQ" by LewisJohnMcgibbney

Apache Wiki Thu, 23 Aug 2012 12:24:58 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "FAQ" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/FAQ?action=diff&rev1=134&rev2=135

  ==== How can I fetch pages that require Authentication? ====
  See the [[HttpAuthenticationSchemes]] wiki page.
  
+ ==== Speed of Fetching seems to decrease between crawl iterations... what's 
wrong? ====
+ 
+ A possible reason is that by default the 'partition.url.mode' is set to 
'byHost', which is a reasonable setting, because in the url-subsets for the 
fetcher threads in different map steps, you want to have disjoint subsets to 
avoid that urls are loaded twice from
+ different machines.
+ 
+ Secondly the default setting for 'generate.max.count' could also be set to 
-1. This means the more urls you collect, especially from the same host, the 
more urls of the same host will be in the same fetcher map job!
+ 
+ Because there is also a policy setting (please do this at home!!) to wait for 
a delay of 30 secs. between calls to the same server, all maps which contains 
urls to the same server are slowing down. Therefore the resulting reduce step 
will only be done when all fetcher maps are done, which is a bottleneck in the 
overall processing step.
+ 
+ The following settings may solve your problem:
+ 
+ Map tasks should be splitted according to the host:
+ {{{
+ <property>
+   <name>partition.url.mode</name>
+   <value>byHost</value>
+   <description>Determines how to partition URLs. Default value is
+ 'byHost',  also takes 'byDomain' or 'byIP'.
+   </description>
+ </property>
+ }}}
+ 
+ Don't insert in a single fetch list more than 10000 entries!
+ {{{
+ <property>
+   <name>generate.max.count</name>
+   <value>10000</value>
+   <description>The maximum number of urls in a single
+   fetchlist.  -1 if unlimited. The urls are counted according
+   to the value of the parameter generator.count.mode.
+   </description>
+ </property>
+ }}}
+ 
+ Wait time between two fetches to the same server.
+ {{{
+ <property>
+  <name>fetcher.max.crawl.delay</name>
+  <value>10</value>
+  <description>
+  If the Crawl-Delay in robots.txt is set to greater than this value (in
+  seconds) then the fetcher will skip this page, generating an error report.
+  If set to -1 the fetcher will never skip such pages and will wait the
+  amount of time retrieved from robots.txt Crawl-Delay, however long that
+  might be.
+  </description>
+ </property>
+ }}}
  === Updating ===
  ==== Isn't there redudant/wasteful duplication between nutch crawldb and solr 
index? ====
  Nutch maintains a crawldb (and linkdb, for that matter) of the urls it 
crawled, the fetch status, and the date. This data is maintained beyond fetch 
so that pages may be re-crawled, after the a re-crawling period. At the same 
time Solr maintains an inverted index of all the fetched pages. It'd seem more 
efficient if Nutch relied on the index instead of maintaining its own crawldb, 
to !store the same url twice? The problem we face here is what Nutch would do 
if we wished to change the Solr core which to index to?

[Nutch Wiki] Trivial Update of "FAQ" by LewisJohnMcgibbney

Reply via email to