[Nutch Wiki] Update of "FAQ" by LewisJohnMcgibbney

Apache Wiki Mon, 18 Jul 2011 04:01:31 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "FAQ" page has been changed by LewisJohnMcgibbney:
http://wiki.apache.org/nutch/FAQ?action=diff&rev1=124&rev2=125

  See  [[HttpAuthenticationSchemes]].
  
  === Updating ===
+ ====Isn't there redudant/wasteful duplication between nutch crawldb and solr 
index?====
+ Nutch maintains a crawldb (and linkdb, for that matter) of the urls it 
crawled, the fetch status, and the date. This data is maintained beyond fetch 
so that pages may be re-crawled, after the a re-crawling period. At the same 
time Solr maintains an inverted index of all the fetched pages. It'd seem more 
efficient if Nutch relied on the index instead of maintaining its own crawldb, 
to !store the same url twice? The problem we face here is what Nutch would do 
if we wished to change the Solr core which to index to?
+ 
+ Whats described above could be done with Nutch 2.0 by adding a SOLR backend 
to GORA. SOLR would be used to store the webtable and provided that you setup 
the schema accordingly you could index the appropriate fields for searching. 
Further to this, because Nutch is a crawler intending to write to more than one 
search engine. Besides, the crawldb is gone, as a flat file, in trunk (2.0). 
Also, Solr is really slow when it comes to updating millions of records, the 
crawldb isn't when split over multiple machines.
+ 
  === Indexing ===
  ==== Is it possible to change the list of common words without crawling 
everything again? ====
  Yes. The list of common words is used only when indexing and searching, and 
not during other steps. So, if you change the list of common words, there is no 
need to re-fetch the content, you just need to re-create segment indexes to 
reflect the changes.

[Nutch Wiki] Update of "FAQ" by LewisJohnMcgibbney

Reply via email to