I can't really use the regex-urlfilter.txt because I still want to fetch/index other sites. But would like to have a couple of sites that have been crawled deeply. I kind of thought that through the Generate/fetch/update iteration process it would just keep going deeper and deeper on all of the sites that are in the index. So that eventually it would capture everything.
But that doesn't seem to be the case, as the first injected URL only has one page in the index, and other Sites have multiple pages. Any suggestions would be greatly appreciated. -John On Wed, Jun 11, 2008 at 11:56 PM, <[EMAIL PROTECTED]> wrote: > Hi John, > > I think the first issue is that you don't have any kind of a filter that > tells nutch to stick only to the sites you care about. Have a look at > regex-urlfilter.txt config file. > > > Otis > -- > Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch > > > ----- Original Message ---- > > From: John Martyniak <[EMAIL PROTECTED]> > > To: [email protected] > > Sent: Wednesday, June 11, 2008 10:13:06 PM > > Subject: Deep Searching and whole web searches > > > > Hi I am new to Nutch. But have played around with it. So far really > like > > the tool. > > I would like to be able to deep crawl a couple of sites, and then also > > spider crawl sites. So that the end result is a index with a large > portion > > of several sites and more of an organic spider of the rest. > > > > I have tried to do this in several ways, I have used the crawl command > and > > set depth level etc, which work I get a valid index and results. > > > > I have also injected the individual URLS of the starting sites into the > > crawldb and iterated through generate/fetch/update sequence, however in > this > > case it covers the whole web index, but it doesn't seem to add any > > additional depth on the starting URLS. Which is an issue. > > > > When I have tried to merge the Crawl results into the > generate/fetch/update > > results, I get errors. > > > > Is there anyway to do this? Also is there anyway to set a priority on > > certain sites, something like these need to be updated daily and the rest > of > > these weekly? > > > > thank you in advance for any help. > > > > -John > > -- John Martyniak Before Dawn Solutions, Inc. 9457 S. University Blvd. #266 Highlands Ranch, CO 80126 o: 1-877-499-1562 x707 (Toll Free) c: 303-522-1756 e: [EMAIL PROTECTED] w: http://www.beforedawn.com
