[Nutch Wiki] Update of "FAQ" by GodmarBack

Apache Wiki Wed, 06 Jan 2010 15:31:19 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The "FAQ" page has been changed by GodmarBack.
The comment on this change is: Corrected formatting - the {{{ must be in the 
first column, apparently..
http://wiki.apache.org/nutch/FAQ?action=diff&rev1=111&rev2=112

--------------------------------------------------

  There are at least two choices to do that:
  
    First you need to copy the .WAR file to the servlet container webapps 
folder.
+ {{{
-      {{{% cp nutch-0.7.war $CATALINA_HOME/webapps/ROOT.war
+    % cp nutch-0.7.war $CATALINA_HOME/webapps/ROOT.war
  }}}
  
    1) After building your first index, start Tomcat from the index folder.
      Assuming your index is located at /index :
+ {{{
-     {{{% cd /index/
+ % cd /index/
- % $CATATALINA_HOME/bin/startup.sh}}}
+ % $CATATALINA_HOME/bin/startup.sh
+ }}}
      '''Now you can search.'''
  
    2) After building your first index, start and stop Tomcat which will make 
Tomcat extrat the Nutch webapp. Than you need to edit the nutch-site.xml and 
put in it the location of the index folder.
+ {{{
-     {{{% $CATATALINA_HOME/bin/startup.sh
+ % $CATATALINA_HOME/bin/startup.sh
- % $CATATALINA_HOME/bin/shutdown.sh}}}
+ % $CATATALINA_HOME/bin/shutdown.sh
+ }}}
  
+ {{{
-     {{{% vi $CATATALINA_HOME/bin/webapps/ROOT/WEB-INF/classes/nutch-site.xml
+ % vi $CATATALINA_HOME/bin/webapps/ROOT/WEB-INF/classes/nutch-site.xml
  <?xml version="1.0"?>
  <?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
  
@@ -85, +91 @@

  
  </nutch-conf>
  
- % $CATATALINA_HOME/bin/startup.sh}}}
+ % $CATATALINA_HOME/bin/startup.sh
+ }}}
  
  === Injecting ===
  
@@ -110, +117 @@

  
        You'll need to create a file fetcher.done in the segment directory an 
than: [[http://wiki.apache.org/nutch/bin/nutch_updatedb|updatedb]], 
[[http://wiki.apache.org/nutch/bin/nutch_generate|generate]] and 
[[http://wiki.apache.org/nutch/bin/nutch_fetch|fetch]] .
        Assuming your index is at /index
+ {{{ 
-       {{{ % touch /index/segments/2005somesegment/fetcher.done
+ % touch /index/segments/2005somesegment/fetcher.done
  
  % bin/nutch updatedb /index/db/ /index/segments/2005somesegment/
  
  % bin/nutch generate /index/db/ /index/segments/2005somesegment/
  
- % bin/nutch fetch /index/segments/2005somesegment}}}
+ % bin/nutch fetch /index/segments/2005somesegment
+ }}}
  
        All the pages that were not crawled will be re-generated for fetch. If 
you fetched lots of pages, and don't want to have to re-fetch them again, this 
is the best way.
  
@@ -146, +155 @@

  
  If you have a fast internet connection (> 10Mb/sec) your bottleneck will 
definitely be in the machine itself (in fact you will need multiple machines to 
saturate the data pipe).  Empirically I have found that the machine works well 
up to about 1000-1500 threads.  
  
- To get this to work on my Linux box I needed to set the ulimit to 65535 
(ulimit -n 65535), and I had to make sure that the DNS server could handle the 
load (we had to speak with our colo to get them to shut off an artifical cap on 
the DNS servers).  Also, in order to get the speed up to a reasonable value, we 
needed to set the maximum fetches per host to 100 (otherwise we get a quick 
start followed by a very long slow tail of fetching).
+ To get this to work on my Linux box I needed to set the ulimit to 65535 
(ulimit -n 65535), and I had to make sure that the DNS server could handle the 
load (we had to speak with our colo to get them to shut off an artificial cap 
on the DNS servers).  Also, in order to get the speed up to a reasonable value, 
we needed to set the maximum fetches per host to 100 (otherwise we get a quick 
start followed by a very long slow tail of fetching).
  
  To other users: please add to this with your own experiences, my own 
experience may be atypical.
  
@@ -208, +217 @@

      +.*
  
    3) By default the 
[[http://www.nutch.org/docs/api/net/nutch/protocol/file/package-summary.html|"file
 plugin"]] is disabled. nutch-site.xml needs to be modified to allow this 
plugin. Add an entry like this:
- 
+ {{{
      <property>
        <name>plugin.includes</name>
        
<value>protocol-file|protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)</value>
      </property>
+ }}}
  
  Now you can invoke the crawler and index all or part of your disk. The only 
remaining gotcha is that if you use Mozilla it will '''not''' load file: URLs 
from a web paged fetched with http, so if you test with the Nutch web container 
running in Tomcat, annoyingly, as you click on results nothing will happen as 
Mozilla by default does not load file URLs. This is mentioned 
[[http://www.mozilla.org/quality/networking/testing/filetests.html|here]] and 
this behavior may be disabled by a 
[[http://www.mozilla.org/quality/networking/docs/netprefs.html|preference]] 
(see security.checkloaduri). IE5 does not have this problem.
  
  ==== Nutch crawling parent directories for file protocol ->  misconfigured 
URLFilters ====
  [[http://issues.apache.org/jira/browse/NUTCH-407]] E.g. for urlfilter-regex 
you should put the following in regex-urlfilter.txt :
  {{{
- 
  +^file:///c:/top/directory/
  -.
  }}}
@@ -243, +252 @@

  '''What is happening?'''
  
    By default, the size of the documents downloaded by Nutch is limited (to 
65536 bytes). To allow Nutch to download larger files (via HTTP), modify 
nutch-site.xml and add an entry like this:
+ {{{
      <property>
        <name>http.content.limit</name>
        <value>'''150000'''</value>
      </property>
- 
+ }}}
    If you do not want to limit the size of downloaded documents, set 
http.content.limit to a negative value:
+ {{{
      <property>
        <name>http.content.limit</name>
        <value>'''-1'''</value>
      </property>
+ }}}
  
  === Segment Handling ===
  
@@ -282, +294 @@

      <description>The host and port that the MapReduce job tracker runs at. If 
"local", then jobs are run in-process as a single map and reduce 
task.</description>
    </property>
  
- 
    edit conf/mapred-default.xml
    <property>
      <name>mapred.map.tasks</name>
      <value>4</value>
      <description>define mapred.map.tasks to be multiple of number of slave 
hosts
- </description>
+     </description>
    </property>
  
    <property>
@@ -298, +309 @@

    </property>
  
    create a file with slave host names
- 
-   {{{
+ {{{
    % echo localhost >> ~/.slaves
    % echo somemachin >> ~/.slaves}}}
  
    start all ndfs & mapred daemons
-   {{{
+ {{{
    % bin/start-all.sh
    }}}
  
    create a directory with seed list file
-   {{{
+ {{{
    % mkdir seeds
    % echo http://www.cnn.com/ > seeds/urls
    }}}
  
-   copt the seed directory to ndfs
+   copy the seed directory to ndfs
-   {{{
+ {{{
    % bin/nutch ndfs -put seeds seeds
    }}}
  
    crawl a bit
-   {{{
+ {{{
    % bin/nutch crawl seeds -depth 3
    }}}
  
@@ -336, +346 @@

  ==== How to send commands to NDFS? ====
  
    list files in the root of NDFS
-   {{{
+ {{{
    [r...@xxxxxx mapred]# bin/nutch ndfs -ls /
    050927 160948 parsing file:/mapred/conf/nutch-default.xml
    050927 160948 parsing file:/mapred/conf/nutch-site.xml

[Nutch Wiki] Update of "FAQ" by GodmarBack

Reply via email to