[Nutch Wiki] Update of "FAQ" by GodmarBack

Apache Wiki Wed, 06 Jan 2010 15:54:18 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "FAQ" page has been changed by GodmarBack.
The comment on this change is: added useful link to Crawling the local 
filesystem page..
http://wiki.apache.org/nutch/FAQ?action=diff&rev1=112&rev2=113

--------------------------------------------------

  
  Now you can invoke the crawler and index all or part of your disk. The only 
remaining gotcha is that if you use Mozilla it will '''not''' load file: URLs 
from a web paged fetched with http, so if you test with the Nutch web container 
running in Tomcat, annoyingly, as you click on results nothing will happen as 
Mozilla by default does not load file URLs. This is mentioned 
[[http://www.mozilla.org/quality/networking/testing/filetests.html|here]] and 
this behavior may be disabled by a 
[[http://www.mozilla.org/quality/networking/docs/netprefs.html|preference]] 
(see security.checkloaduri). IE5 does not have this problem.
  
- ==== Nutch crawling parent directories for file protocol ->  misconfigured 
URLFilters ====
+ ==== Nutch crawling parent directories for file protocol ====
+ 
+ If you find nutch crawling parent directories when using the file protocol, 
the following kludge may help:
+ 
- [[http://issues.apache.org/jira/browse/NUTCH-407]] E.g. for urlfilter-regex 
you should put the following in regex-urlfilter.txt :
+ [[http://issues.apache.org/jira/browse/NUTCH-407]] E.g. for urlfilter-regex 
you could put the following in regex-urlfilter.txt :
  {{{
  +^file:///c:/top/directory/
  -.
  }}}
+ 
+ Alternatively, you could apply the patch described 
[[http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch|on 
this page]], which would avoid the hardwiring of the site-specific 
/top/directory in your configuration file.
  
  ==== How do I index remote file shares? ====

[Nutch Wiki] Update of "FAQ" by GodmarBack

Reply via email to