Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "FAQ" page has been changed by GodmarBack. The comment on this change is: added useful link to Crawling the local filesystem page.. http://wiki.apache.org/nutch/FAQ?action=diff&rev1=112&rev2=113 -------------------------------------------------- Now you can invoke the crawler and index all or part of your disk. The only remaining gotcha is that if you use Mozilla it will '''not''' load file: URLs from a web paged fetched with http, so if you test with the Nutch web container running in Tomcat, annoyingly, as you click on results nothing will happen as Mozilla by default does not load file URLs. This is mentioned [[http://www.mozilla.org/quality/networking/testing/filetests.html|here]] and this behavior may be disabled by a [[http://www.mozilla.org/quality/networking/docs/netprefs.html|preference]] (see security.checkloaduri). IE5 does not have this problem. - ==== Nutch crawling parent directories for file protocol -> misconfigured URLFilters ==== + ==== Nutch crawling parent directories for file protocol ==== + + If you find nutch crawling parent directories when using the file protocol, the following kludge may help: + - [[http://issues.apache.org/jira/browse/NUTCH-407]] E.g. for urlfilter-regex you should put the following in regex-urlfilter.txt : + [[http://issues.apache.org/jira/browse/NUTCH-407]] E.g. for urlfilter-regex you could put the following in regex-urlfilter.txt : {{{ +^file:///c:/top/directory/ -. }}} + + Alternatively, you could apply the patch described [[http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch|on this page]], which would avoid the hardwiring of the site-specific /top/directory in your configuration file. ==== How do I index remote file shares? ====