Whole web crawling is about indexing the entire web versus deep indexing of a single site.

The urls parameter is the urlDir, a directory which should hold one or more text files with listings of urls to be fetched. The dir parameter is the output directory for the crawls.

Because you are crawling the local files you would either need urls in the initial urlDir text file or those documents you are crawling would need to point to the other urls.

Is that how you have it setup?

Dennis


Vincent155 wrote:
I have a virtual machine running (VMware 1.0.7). Both host and guest run on
Fedora 10. In the virtual machine, I have Nutch installed. I can index
directories on my host as if they are websites.

Now I want to compare Nutch with another search enige. For that, I want to
index some 2,500 files in a directory. But when I execute a command like
"crawl urls -dir crawl.test -depth 3 -topN 2500", of leave away the
topN-statement, there are still only some 50 to 75 files indexed.

I have searched for more information, but it is not clear to me if (for
example) "whole web crawling" is about a complete crawl of one site, or
about crawling and updating groups of sites. My question is about a complete
crawl (indexing) of one "site".

Reply via email to