Hi John,

I think the first issue is that you don't have any kind of a filter that tells 
nutch to stick only to the sites you care about.  Have a look at 
regex-urlfilter.txt config file.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
> From: John Martyniak <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Wednesday, June 11, 2008 10:13:06 PM
> Subject: Deep Searching and whole web searches
> 
> Hi I am new to Nutch.  But have played around with it.  So far really like
> the tool.
> I would like to be able to deep crawl a couple of sites, and then also
> spider crawl sites.  So that the end result is a index with a large portion
> of several sites and more of an organic spider of the rest.
> 
> I have tried to do this in several ways, I have used the crawl command and
> set depth level etc, which work I get a valid index and results.
> 
> I have also injected the individual URLS of the starting sites into the
> crawldb and iterated through generate/fetch/update sequence, however in this
> case it covers the whole web index, but it doesn't seem to add any
> additional depth on the starting URLS.  Which is an issue.
> 
> When I have tried to merge the Crawl results into the generate/fetch/update
> results, I get errors.
> 
> Is there anyway to do this?  Also is there anyway to set a priority on
> certain sites, something like these need to be updated daily and the rest of
> these weekly?
> 
> thank you in advance for any help.
> 
> -John

Reply via email to