hi, i have identified one source of such a problem and opened an issue at jira. you can apply this patch and check whether is solves your problem.
https://issues.apache.org/jira/browse/NUTCH-774 btw you can check also your crawldb for such items - the retry interval is set to 0. just dump the crawldb and search for it. regards reinhard BELLINI ADAM schrieb: > hi, > > i'm crawling my intranet , and i have setted the db.fetch.interval.default to > be 5 hours, but it seens that it doesnt work correctly > > <property> > > <name>db.fetch.interval.default</name> > > <value>18000</value> > > <description>The number of seconds between re-fetches of a page (5 hours ). > > </description> > > </property> > > > the first crawl when the crawl directory $crawl doesnt existe yet the crawl > spend just 2 hours (with depth=10). > > > but when performing the recrawl with the recrawl.sh script (with crawldb > full), it takes like 2 hours for each depth !! > > and when i checked the log file i found that one URL is fetched like several > times ! so did my 5 hours db.fetch.interval.default works correctly ? > > why it's refetching same URLs several time at each depth (depth =10), i > understood that the timestamp of pages will not allow a refetch since the > time (5 hours) is not spent yet ! > > plz can you just explain me how this db.fetch.interval.defaul works ? should > i use only one depth since i have all the urls in the crawldb ? > > > > > i'm using this recrawl script : > > depth=10 > > echo "----- Inject (Step 1 of $steps) -----" > $NUTCH_HOME/bin/nutch inject $crawl/crawldb urls > > echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----" > for((i=0; i < $depth; i++)) > > do > > echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---" > > $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments $topN > > if [ $? -ne 0 ] > then > echo "runbot: Stopping at depth $depth. No more URLs to fetch." > break > fi > segment=`ls -d $crawl/segments/* | tail -1` > > $NUTCH_HOME/bin/nutch fetch $segment -threads $threads > > if [ $? -ne 0 ] > then > echo "runbot: fetch $segment at depth `expr $i + 1` failed." > echo "runbot: Deleting segment $segment." > rm $RMARGS $segment > continue > fi > > echo " ----- Updating Dadatabase ( $steps) -----" > > > $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment > > done > > echo "----- Merge Segments (Step 3 of $steps) -----" > $NUTCH_HOME/bin/nutch mergesegs $crawl/MERGEDsegments $crawl/segments/* > > > rm $crawl/segments > mv $crawl/MERGEDsegments $crawl/segments > > echo "----- Invert Links (Step 4 of $steps) -----" > $NUTCH_HOME/bin/nutch invertlinks $crawl/linkdb $crawl/segments/* > > > > > _________________________________________________________________ > Eligible CDN College & University students can upgrade to Windows 7 before > Jan 3 for only $39.99. Upgrade now! > http://go.microsoft.com/?linkid=9691819 >
