Re: How to run a complete crawl?

Dennis Kubes Fri, 16 Oct 2009 07:20:24 -0700

Whole web crawling is about indexing the entire web versus deep indexingof a single site.

The urls parameter is the urlDir, a directory which should hold one ormore text files with listings of urls to be fetched. The dir parameteris the output directory for the crawls.

Because you are crawling the local files you would either need urls inthe initial urlDir text file or those documents you are crawling wouldneed to point to the other urls.


Is that how you have it setup?

Dennis


Vincent155 wrote:

I have a virtual machine running (VMware 1.0.7). Both host and guest run on
Fedora 10. In the virtual machine, I have Nutch installed. I can index
directories on my host as if they are websites.

Now I want to compare Nutch with another search enige. For that, I want to
index some 2,500 files in a directory. But when I execute a command like
"crawl urls -dir crawl.test -depth 3 -topN 2500", of leave away the
topN-statement, there are still only some 50 to 75 files indexed.

I have searched for more information, but it is not clear to me if (for
example) "whole web crawling" is about a complete crawl of one site, or
about crawling and updating groups of sites. My question is about a complete
crawl (indexing) of one "site".

Re: How to run a complete crawl?

Reply via email to