Hi John, I think the first issue is that you don't have any kind of a filter that tells nutch to stick only to the sites you care about. Have a look at regex-urlfilter.txt config file.
Otis -- Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch ----- Original Message ---- > From: John Martyniak <[EMAIL PROTECTED]> > To: [email protected] > Sent: Wednesday, June 11, 2008 10:13:06 PM > Subject: Deep Searching and whole web searches > > Hi I am new to Nutch. But have played around with it. So far really like > the tool. > I would like to be able to deep crawl a couple of sites, and then also > spider crawl sites. So that the end result is a index with a large portion > of several sites and more of an organic spider of the rest. > > I have tried to do this in several ways, I have used the crawl command and > set depth level etc, which work I get a valid index and results. > > I have also injected the individual URLS of the starting sites into the > crawldb and iterated through generate/fetch/update sequence, however in this > case it covers the whole web index, but it doesn't seem to add any > additional depth on the starting URLS. Which is an issue. > > When I have tried to merge the Crawl results into the generate/fetch/update > results, I get errors. > > Is there anyway to do this? Also is there anyway to set a priority on > certain sites, something like these need to be updated daily and the rest of > these weekly? > > thank you in advance for any help. > > -John
