HI, I m new to the world of nutch. I am trying to crawl local file systems on LAN using nutch 1.0. Documents are rarely modified and then search them using solr. And frequency of recrawling is 1 day as documents are frequently added and deleted. I have few queries regarding recrawling.
1. What is the major difference between bin/nutch crawl Command and the recrawling script given in wiki? is it just that the script merges the segments? I more curious on the performance issue. 2. Is there any way to inform Solr Index to delete a particular document as that resource do not exist any longer after recrawling? I dont want create a new SolrIndex every time i crawl, i want to update my index. 3. As documents are rarely modified i want them to be fetched only when they get modified. But, after interval.default is exceeded, the document is fetched without taking into consideration whether the document has been modified or not. Is there any way around of fetching only those documents that are newly added or those that have been modified? Thanks a lot.. -- Regards, Arpit Khurdiya