On local file system crawl, why does nutch crawl parent directories?

John George Sun, 29 Oct 2006 16:50:29 -0800

I'm crawling a directory on my local Windows file
system. However, nutch crawls all of the top level
directories
 on my C: drive - not just the directory I told it to
crawl. Is this a bug or expected behavior? If it is
expected behavior - why?


I created a directory of sample documents at
c:\nutch-0.8.1\input. This directory contains a single
word document and a sub directory with additional word
and pdf documents. I also created a single url file,
which I'll pass to the crawler. It has the following
entry: file:///C:/nutch-0.8.1/input

During the crawl, I notice 404 errors. For example:

        fetching file:/C:/nutch-0.8.1/DOC1.doc
        fetching file:/C:/nutch-0.8.1/fin/
        fetch of file:/C:/nutch-0.8.1/fin/ failed with:
org.apache.nutch.protocol.file.FileError: File Error:
404
        fetch of file:/C:/nutch-0.8.1/DOC1.doc failed with:
org.apache.nutch.protocol.file.FileError: File Error:
404

        
Fetcher: done


Why is nutch looking for "DOC1.doc" in C:\nutch-0.8.1?
Where did it get the idea to look for that doc in that
wrong location? It should only look for it in
c:\nutch-0.8.1\input (and it does eventually find it
in there).

After the crawl is finished, I can see top level C:
folders and documents as being crawled. For example,
in the crawldb's dump, here is an entry that should
not have been crawled:

        file:/C:/CONFIG.SYS     Version: 4
        Status: 1 (DB_unfetched)
        Fetch time: Sun Oct 29 00:56:27 PDT 2006
        Modified time: Wed Dec 31 16:00:00 PST 1969
        Retries since fetch: 0
        Retry interval: 30.0 days
        Score: 0.004761905
        Signature: null
        Metadata: null

Why did C:\Config.sys get crawled when I specified the
crawl directory as c:\nutch-0.8.1\input?

For what it's worth, I've set the following in my
crawl-urlfilter.txt: 

+^file://*


Finally, someone posted a very helpful resource on
crawling the local filesystem at
http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch
--- in item #7, this person suggests changing the
org.apache.nutch.protocol.file.FileResponse.getDirAsHttpResponse(java.io.File
f) method and recompiling to get rid of this behavior.


Thank you,
John 


 
____________________________________________________________________________________
We have the perfect Group for you. Check out the handy changes to Yahoo! Groups 
(http://groups.yahoo.com)

On local file system crawl, why does nutch crawl parent directories?

Reply via email to