Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by Gal Nitzan: http://wiki.apache.org/nutch/FAQ ------------------------------------------------------------------------------ ==== What Java version is required to run Nutch? ==== Nutch 0.7 will run with Java 1.4 and up. + + ==== I have two XML files, nutch-default.xml and nutch-site.xml, why? ==== + + nutch-default.xml is the out of the box configuration for nutch. Most configuration can (and should unless you know what your doing) stay as it is. + nutch-site.xml is where you make the changes that override the default settings. + The same goes to the servlet container application. ==== My system does not find the segments folder. Why? OR How do I tell the ''Nutch Servlet'' where the index file are located? ==== @@ -53, +59 @@ % $CATATALINA_HOME/bin/startup.sh}}} - ==== I have two XML files, nutch-default.xml and nutch-site.xml, why? ==== - - nutch-default.xml is the out of the box configuration for nutch. Most configuration can (and should unless you know what your doing) stay as it is. - nutch-site.xml is where you make the changes that override the default settings. - The same goes to the servlet container application. - === Injecting === ==== What happens if I inject urls several times? ==== @@ -76, +76 @@ Well, you can not! However, you have two choices to proceed: 1) Recover the pages already fetched and than restart the fetcher. - * You'll need to create a dummy file called fetcher.done in the segment directory. % touch index/yoursegdir/fetcher.done . All the pages that were not crawled will be re-generated for fetch pretty soon. If you fetched lots of pages, and don't want to have to re-fetch them again, this is the best way. + * You'll need to create a dummy file called fetcher.done in the segment directory, updatedb, generate and restart the fetcher. + Assuming your index is at /index + {{{ % touch /index/segments/2005somesegment/fetcher.done + % bin/nutch updatedb /index/db/ /index/segments/2005somesegment/ + % bin/nutch generate /index/db/ /index/segments/2005somesegment/ + % bin/nutch fetch /index/segments/2005somesegment}}} + + All the pages that were not crawled will be re-generated for fetch. If you fetched lots of pages, and don't want to have to re-fetch them again, this is the best way. + 2) Discard the aborted output. * Delete all folders from the segment folder except the fetchlist folder and restart the fetcher.
