Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification.
The following page has been changed by Gal Nitzan: http://wiki.apache.org/nutch/FAQ ------------------------------------------------------------------------------ ==== How can I recover an aborted fetch process? ==== - Well, you can not! However, you have two choices to proceed: + Well, you can not. '''However, you have two choices to proceed''': - 1) Recover the pages already fetched and than restart the fetcher. + 1) Recover the pages already fetched and than restart the fetcher. - You'll need to create a dummy file called fetcher.done in the segment directory, updatedb, generate and restart the fetcher. + You'll need to create a file '''fetcher.done''' in the segment directory an than: updatedb, generate and fetch. Assuming your index is at /index {{{ % touch /index/segments/2005somesegment/fetcher.done @@ -90, +90 @@ All the pages that were not crawled will be re-generated for fetch. If you fetched lots of pages, and don't want to have to re-fetch them again, this is the best way. - 2) Discard the aborted output. + 2) Discard the aborted output. Delete all folders from the segment folder except the fetchlist folder and restart the fetcher. ==== Who changes the next fetch date? ==== + * After injecting a new url the next fetch date is set to the current time. * Generating a fetchlist enhances the date by 7 days. * Updating the db sets the date to the current time + db.default.fetch.interval - 7 days. ==== I have a big fetchlist in my segments folder. How can I fetch only some sites at a time? ==== + * You have to decide how many pages you want to crawl before generating segments and use the options of bin/nutch generate. * Use -topN to limit the amount of pages all together. * Use -numFetchers to generate multiple small segments.
