Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The "FAQ" page has been changed by LewisJohnMcgibbney: http://wiki.apache.org/nutch/FAQ?action=diff&rev1=132&rev2=133 ==== How do I index my local file system? ==== The tricky thing about Nutch is that out of the box it has most plugins disabled and is tuned for a crawl of a "remote" web server - you '''have''' to change config files to get it to crawl your local disk. - . 1) crawl-urlfilter.txt needs a change to allow file: URLs while not following http: ones, otherwise it either won't index anything, or it'll jump off your disk onto web sites. + . 1) regex-urlfilter.txt needs a change to allow file: URLs while not following http: ones, otherwise it either won't index anything, or it'll jump off your disk onto web sites. . Change this line: -^(file|ftp|mailto|https): to this: -^(http|ftp|mailto|https): - 2) crawl-urlfilter.txt may have rules at the bottom to reject some URLs. If it has this fragment it's probably ok: + 2) regex-urlfilter.txt may have rules at the bottom to reject some URLs. If it has this fragment it's probably ok: . # accept anything else +.* 3) By default the protocol-file plugin is disabled. nutch-site.xml needs to be modified to allow this plugin. Add an entry like this:

