[Nutch Wiki] Update of "FAQ" by Gal Nitzan

Apache Wiki Thu, 22 Sep 2005 23:53:52 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by Gal Nitzan:
http://wiki.apache.org/nutch/FAQ

------------------------------------------------------------------------------
  ==== What Java version is required to run Nutch? ====
  
  Nutch 0.7 will run with Java 1.4 and up.
+ 
+ ==== I have two XML files, nutch-default.xml and nutch-site.xml, why? ====
+ 
+ nutch-default.xml is the out of the box configuration for nutch. Most 
configuration can (and should unless you know what your doing) stay as it is.
+ nutch-site.xml is where you make the changes that override the default 
settings.
+ The same goes to the servlet container application.
  
  ==== My system does not find the segments folder. Why? OR How do I tell the 
''Nutch Servlet'' where the index file are located? ====
  
@@ -53, +59 @@

  
  % $CATATALINA_HOME/bin/startup.sh}}}
  
- ==== I have two XML files, nutch-default.xml and nutch-site.xml, why? ====
- 
- nutch-default.xml is the out of the box configuration for nutch. Most 
configuration can (and should unless you know what your doing) stay as it is.
- nutch-site.xml is where you make the changes that override the default 
settings.
- The same goes to the servlet container application.
- 
  === Injecting ===
  
  ==== What happens if I inject urls several times? ====
@@ -76, +76 @@

  
  Well, you can not! However, you have two choices to proceed:
     1) Recover the pages already fetched and than restart the fetcher.
-       * You'll need to create a dummy file called fetcher.done in the segment 
directory. % touch index/yoursegdir/fetcher.done . All the pages that were not 
crawled will be re-generated for fetch pretty soon. If you fetched lots of 
pages, and don't want to have to re-fetch them again, this is the best way.
+       * You'll need to create a dummy file called fetcher.done in the segment 
directory, updatedb, generate and restart the fetcher.
+         Assuming your index is at /index
+         {{{ % touch /index/segments/2005somesegment/fetcher.done
+ % bin/nutch updatedb /index/db/ /index/segments/2005somesegment/
+ % bin/nutch generate /index/db/ /index/segments/2005somesegment/
+ % bin/nutch fetch /index/segments/2005somesegment}}}
+ 
+ All the pages that were not crawled will be re-generated for fetch. If you 
fetched lots of pages, and don't want to have to re-fetch them again, this is 
the best way.
+ 
     2) Discard the aborted output.
        * Delete all folders from the segment folder except the fetchlist 
folder and restart the fetcher.

[Nutch Wiki] Update of "FAQ" by Gal Nitzan

Reply via email to