I'm crawling a directory on my local Windows file system. However, nutch crawls all of the top level directories on my C: drive - not just the directory I told it to crawl. Is this a bug or expected behavior? If it is expected behavior - why?
I created a directory of sample documents at c:\nutch-0.8.1\input. This directory contains a single word document and a sub directory with additional word and pdf documents. I also created a single url file, which I'll pass to the crawler. It has the following entry: file:///C:/nutch-0.8.1/input During the crawl, I notice 404 errors. For example: fetching file:/C:/nutch-0.8.1/DOC1.doc fetching file:/C:/nutch-0.8.1/fin/ fetch of file:/C:/nutch-0.8.1/fin/ failed with: org.apache.nutch.protocol.file.FileError: File Error: 404 fetch of file:/C:/nutch-0.8.1/DOC1.doc failed with: org.apache.nutch.protocol.file.FileError: File Error: 404 Fetcher: done Why is nutch looking for "DOC1.doc" in C:\nutch-0.8.1? Where did it get the idea to look for that doc in that wrong location? It should only look for it in c:\nutch-0.8.1\input (and it does eventually find it in there). After the crawl is finished, I can see top level C: folders and documents as being crawled. For example, in the crawldb's dump, here is an entry that should not have been crawled: file:/C:/CONFIG.SYS Version: 4 Status: 1 (DB_unfetched) Fetch time: Sun Oct 29 00:56:27 PDT 2006 Modified time: Wed Dec 31 16:00:00 PST 1969 Retries since fetch: 0 Retry interval: 30.0 days Score: 0.004761905 Signature: null Metadata: null Why did C:\Config.sys get crawled when I specified the crawl directory as c:\nutch-0.8.1\input? For what it's worth, I've set the following in my crawl-urlfilter.txt: +^file://* Finally, someone posted a very helpful resource on crawling the local filesystem at http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch --- in item #7, this person suggests changing the org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(java.io.File f) method and recompiling to get rid of this behavior. Thank you, John ____________________________________________________________________________________ We have the perfect Group for you. Check out the handy changes to Yahoo! Groups (http://groups.yahoo.com)