[Nutch Wiki] Update of "FAQ" by GodmarBack

Apache Wiki Wed, 06 Jan 2010 16:46:14 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "FAQ" page has been changed by GodmarBack.
http://wiki.apache.org/nutch/FAQ?action=diff&rev1=113&rev2=114

--------------------------------------------------

  {{{
      <property>
        <name>plugin.includes</name>
-       
<value>protocol-file|protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)</value>
+       <value>protocol-file|...copy original values from nutch-default 
here...</value>
      </property>
  }}}
+ 
+ where you should copy and paste all values from nutch-default.xml in the 
plugin.includes setting provided there. This will ensure that all plug-in 
normally enabled will be enabled, plus the protocol-file plugin. Make sure to 
include parse-pdf if you want to parse PDF files. Make sure that 
urlfilter-regexp is included, or else '''the *urlfilter files will be 
ignored''', leading nutch to accept all URLs. You need to enable crawl URL 
filters to prevent nutch from crawling up the parent directory, see below.
  
  Now you can invoke the crawler and index all or part of your disk. The only 
remaining gotcha is that if you use Mozilla it will '''not''' load file: URLs 
from a web paged fetched with http, so if you test with the Nutch web container 
running in Tomcat, annoyingly, as you click on results nothing will happen as 
Mozilla by default does not load file URLs. This is mentioned 
[[http://www.mozilla.org/quality/networking/testing/filetests.html|here]] and 
this behavior may be disabled by a 
[[http://www.mozilla.org/quality/networking/docs/netprefs.html|preference]] 
(see security.checkloaduri). IE5 does not have this problem.

[Nutch Wiki] Update of "FAQ" by GodmarBack

Reply via email to