[Nutch Wiki] Update of "FAQ" by Gal Nitzan

Apache Wiki Tue, 20 Sep 2005 00:17:28 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by Gal Nitzan:
http://wiki.apache.org/nutch/FAQ

------------------------------------------------------------------------------
- =  Nutch FAQ =
+ This is the official Nutch FAQ.
  
- Please feel free to answer and add questions!
+ [[TableOfContents]]
  
- Please also have a look at the error messages, their reasons and solutions
+ == Nutch FAQ ==
  
+ === General ===
-   * General
-   * Injecting
-   * Fetching
-   * Updating
-   * Indexing
-   * Segment Handling
-   * Searching
-   * Crawling
  
+ ==== Are there any mailing lists available? ====
- == General ==
- '''How do I tell the ''Nutch Servlet'' where the index file are located?'''
-   * There are at least two ways to do that:
-     1. bla
-     2. bla
  
- == Injecting ==
+ There's a user, developer, commits and agents lists, all available at 
http://lucene.apache.org/nutch/mailing_lists.html#Agents .
  
+ ==== My system does not find the segments folder. Why? OR How do I tell the 
''Nutch Servlet'' where the index file are located? ====
- '''What happens if I inject urls several times?'''
-   * Urls, which are already in the database, won't be injected.
  
+ There are at least two choices to do that:
  
- ----
+   1) First you need to copy the .WAR file to the servlet container webapps 
folder.
+      % cp nutch-0.7.war $CATALINA_HOME/webapps/ROOT.war
+ 
+   * After building your first index, start Tomcat from the index folder.
+     Assuming your index is located at /index/db/
+     % cd /index/db/
+     % $CATATALINA_HOME/bin/startup.sh
+   * After building your first index, start Tomcat from the index folder.
+     Start Tomcat
+      % $CATATALINA_HOME/bin/startup.sh
+     Stop Tomcat
+      % $CATATALINA_HOME/bin/startup.sh
+     Tomcat has extracted the contens of the ROOT.war file
+     Edit the nutch-default.xml which is located at:
+        $CATATALINA_HOME/bin/webapps/ROOT/WEB-INF/classes/
+        look for the entry: searcher.dir and replace it with your index 
location /index/db
+ 
+ === Injecting ===
+ 
+ ==== What happens if I inject urls several times? ====
+ 
+ Urls, which are already in the database, won't be injected.
  
  == Fetching ==
  
- '''Is it possible to fetch only pages from some specific domains?'''
+ ==== Is it possible to fetch only pages from some specific domains? ====
-   * Please have a look on PrefixURLFilter.
-   * Adding some regular expressions to the urlfilter.regex.file might work, 
but adding a list with thousands of regular expressions would slow down your 
system excessively.
  
+ Please have a look on PrefixURLFilter.
+ Adding some regular expressions to the urlfilter.regex.file might work, but 
adding a list with thousands of regular expressions would slow down your system 
excessively.
- '''How can I recover an aborted fetch process?'''
-   * You have two choices:
-      1) Use the aborted output. You'll need to touch the file fetcher.done in 
the segment directory. All the pages that were not crawled will be re-generated 
for fetch pretty soon. If you fetched lots of pages, and don't want to have to 
re-fetch them again, this is the best way.
-      2) Discard the aborted output. To do this, just delete the fetcher* 
directories in the segment and restart the fetcher.
  
+ ==== How can I recover an aborted fetch process? ====
+ You have two choices:
+    1) Use the aborted output. You'll need to touch the file fetcher.done in 
the segment directory. All the pages that were not crawled will be re-generated 
for fetch pretty soon. If you fetched lots of pages, and don't want to have to 
re-fetch them again, this is the best way.
+    2) Discard the aborted output. To do this, just delete the fetcher* 
directories in the segment and restart the fetcher.
+ 
- '''Who changes the next fetch date?'''
+ ==== Who changes the next fetch date? ====
    * After injecting a new url the next fetch date is set to the current time.
    * Generating a fetchlist enhances the date by 7 days.
    * Updating the db sets the date to the current time + 
db.default.fetch.interval - 7 days.
  
- '''I have a big fetchlist in my segments folder. How can I fetch only some 
sites at a time?'''
+ ==== I have a big fetchlist in my segments folder. How can I fetch only some 
sites at a time? ====
    * You have to decide how many pages you want to crawl before generating 
segments and use the options of bin/nutch generate.
    * Use -topN to limit the amount of pages all together.
    * Use -numFetchers to generate multiple small segments.
    * Now you could either generate new segments. Maybe you whould use -adddays 
to allow bin/nutch generate to put all the urls in the new fetchlist again. Add 
more then 7 days if you did not make a updatedb.
    * Or send the process a unix STOP signal. You should be able to index the 
part of the segment for crawling which is allready fetched. Then later send a 
CONT signal to the process. Do not turn off your computer between! :)
  
- '''How can I force fetcher to use custom nutch-config?'''
+ ==== How can I force fetcher to use custom nutch-config? ====
    * Create a new sub-directory under $NUTCH_HOME/conf, like conf/myconfig
    * Copy these files from $NUTCH_HOME/conf to the new directory: 
common-terms.utf8, mime-types.*, nutch-conf.xsl, nutch-default.xml, 
regex-normalize.xml, regex-urlfilter.txt
    * Modify the nutch-default.xml to suite your needs
@@ -60, +71 @@

    * run $NUTCH_HOME/bin/nutch so that it gets the NUTCH_CONF_DIR environment 
variable. You should check the command outputs for lines where the configs are 
loaded, that they are really loaded from your custom dir.
    * Happy using.
  
- 
- ----
- 
- == Updating ==
+ === Updating ===
  
  
- ----
+ === Indexing ===
  
- == Indexing ==
+ ==== Is it possible to change the list of common words without crawling 
everything again? ====
  
- '''Is it possible to change the list of common words without crawling 
everything again?'''
+ ==== How do I index my local file system?====
  
- '''How do I index my local file system?''' The tricky thing about Nutch is 
that out of the box is has most plugins disabled and is tuned for a crawl of a 
"remote" web server - you '''have''' to change config files to get it to crawl 
your local disk.  1. crawl-urlfilter.txt needs a change to allow file: URLs 
while not following http: ones, otherwise it either won't index anything, or 
it'll jump off your disk onto web sites. Change this line:
+ The tricky thing about Nutch is that out of the box it has most plugins 
disabled and is tuned for a crawl of a "remote" web server - you '''have''' to 
change config files to get it to crawl your local disk.
  
- -^(file|ftp|mailto|https):
+   1) crawl-urlfilter.txt needs a change to allow file: URLs while not 
following http: ones, otherwise it either won't index anything, or it'll jump 
off your disk onto web sites.
  
- to this:
+     Change this line:
  
- -^(http|ftp|mailto|https):
+     -^(file|ftp|mailto|https):
  
- 2. crawl-urlfilter.txt may have rules at the bottom to reject some URLs. If 
it has this fragment it's probably ok:
+     to this:
  
+     -^(http|ftp|mailto|https):
- # accept anything else
- +.*
  
- 3. By default the 
[http://www.nutch.org/docs/api/net/nutch/protocol/file/package-summary.html 
"file plugin"] is disabled. nutch-site.xml needs to be modified to allow this 
plugin. Add an entry like this:
+   2) crawl-urlfilter.txt may have rules at the bottom to reject some URLs. If 
it has this fragment it's probably ok:
  
+     # accept anything else
+     +.*
+ 
+   3. By default the 
[http://www.nutch.org/docs/api/net/nutch/protocol/file/package-summary.html 
"file plugin"] is disabled. nutch-site.xml needs to be modified to allow this 
plugin. Add an entry like this:
+ 
- <property>
+     <property>
-   <name>plugin.includes</name>
+       <name>plugin.includes</name>
-   
<value>protocol-file|protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)</value>
+       
<value>protocol-file|protocol-http|parse-(text|html)|index-basic|query-(basic|site|url)</value>
- </property>
+     </property>
  
  Now you can invoke the crawler and index all or part of your disk. The only 
remaining gotcha is that if you use Mozilla it will '''not''' load file: URLs 
from a web paged fetched with http, so if you test with the Nutch web container 
running in Tomcat, annoyingly, as you click on results nothing will happen as 
Mozilla by default does not load file URLs. This is mentioned 
[http://www.mozilla.org/quality/networking/testing/filetests.html here] and 
this behavior may be disabled by a 
[http://www.mozilla.org/quality/networking/docs/netprefs.html preference] (see 
security.checkloaduri). IE5 does not have this problem.
  
- '''While indexing documents, I get the following error:''' ''050529 011245 
fetch okay, but can't parse myfile, reason: Content truncated at 65536 bytes. 
Parser can't handle incomplete msword file.'' '''What is happening?'''
+ ==== While indexing documents, I get the following error: ====
  
- By default, the size of the documents downloaded by Nutch is limited (to 
65536 bytes). To allow Nutch to download larger files (via HTTP), modify 
nutch-site.xml and add an entry like this:
+ ''050529 011245 fetch okay, but can't parse myfile, reason: Content truncated 
at 65536 bytes. Parser can't handle incomplete msword file.''
  
+ '''What is happening?'''
- <property>
-   <name>http.content.limit</name>
-   <value>'''150000'''</value>
- </property>
  
+   By default, the size of the documents downloaded by Nutch is limited (to 
65536 bytes). To allow Nutch to download larger files (via HTTP), modify 
nutch-site.xml and add an entry like this:
- If you do not want to limit the size of downloaded documents, set 
http.content.limit to a negative value.
- ----
  
- == Segment Handling ==
+   <property>
+     <name>http.content.limit</name>
+     <value>====150000====</value>
+   </property>
  
+   If you do not want to limit the size of downloaded documents, set 
http.content.limit to a negative value.
+ 
+ === Segment Handling ===
+ 
- '''Do I have to delete old segments after some time?'''
+ ==== Do I have to delete old segments after some time? ====
    * If you're fetching regularly, segments older than the 
db.default.fetch.interval can be deleted, as their pages should have been 
refetched. This is 30 days by default.
  
+ === Searching ===
  
- ----
- 
- == Searching ==
- 
- '''First Search: My system does not find the segments folder. Why?'''
-   * Please have a look at the nutch-site.xml for the Webserver. 
(WEB-INF/classes/nutch-site.xml) Go to searcher.dir and set the path to the dir 
with the segments and index folders or search-servers.txt.
- 
- '''Common words are saturating my search results.'''
+ ====Common words are saturating my search results.====
  
    * You can tweak your conf/common-terms.utf8 file after creating an index 
through the following command:

[Nutch Wiki] Update of "FAQ" by Gal Nitzan

Reply via email to