the crawl date here has state db_unfetched. it has not been fetched. are you sure that you dont have crawl dates with retry interval 0 seconds?
grep Retry crawldump | grep -v "18000" BELLINI ADAM schrieb: > hi > i dumped the database, and this is what i found: > > > Status: 1 (db_unfetched) > Fetch time: Thu Dec 03 15:53:24 EST 2009 > Modified time: Wed Dec 31 19:00:00 EST 1969 > Retries since fetch: 0 > Retry interval: 18000 seconds (0 days) > Score: 2.0549393 > Signature: null > Metadata: > > > > > so if meeting this url several time in 2 hours, that means becoz of the 0 > days it gonna be fetched several times ?? > it will not look at the 18000 secondes ??? > > > thx > > > > > >> Date: Thu, 3 Dec 2009 22:39:29 +0100 >> From: [email protected] >> To: [email protected] >> Subject: Re: db.fetch.interval.default >> >> hi, >> >> i have identified one source of such a problem and opened an issue at jira. >> you can apply this patch and check whether is solves your problem. >> >> https://issues.apache.org/jira/browse/NUTCH-774 >> >> btw you can check also your crawldb for such items - the retry interval >> is set to 0. >> just dump the crawldb and search for it. >> >> regards >> reinhard >> >> BELLINI ADAM schrieb: >> >>> hi, >>> >>> i'm crawling my intranet , and i have setted the db.fetch.interval.default >>> to be 5 hours, but it seens that it doesnt work correctly >>> >>> <property> >>> >>> <name>db.fetch.interval.default</name> >>> >>> <value>18000</value> >>> >>> <description>The number of seconds between re-fetches of a page (5 hours >>> ). >>> >>> </description> >>> >>> </property> >>> >>> >>> the first crawl when the crawl directory $crawl doesnt existe yet the crawl >>> spend just 2 hours (with depth=10). >>> >>> >>> but when performing the recrawl with the recrawl.sh script (with crawldb >>> full), it takes like 2 hours for each depth !! >>> >>> and when i checked the log file i found that one URL is fetched like >>> several times ! so did my 5 hours db.fetch.interval.default works correctly >>> ? >>> >>> why it's refetching same URLs several time at each depth (depth =10), i >>> understood that the timestamp of pages will not allow a refetch since the >>> time (5 hours) is not spent yet ! >>> >>> plz can you just explain me how this db.fetch.interval.defaul works ? >>> should i use only one depth since i have all the urls in the crawldb ? >>> >>> >>> >>> >>> i'm using this recrawl script : >>> >>> depth=10 >>> >>> echo "----- Inject (Step 1 of $steps) -----" >>> $NUTCH_HOME/bin/nutch inject $crawl/crawldb urls >>> >>> echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----" >>> for((i=0; i < $depth; i++)) >>> >>> do >>> >>> echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---" >>> >>> $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments $topN >>> >>> if [ $? -ne 0 ] >>> then >>> echo "runbot: Stopping at depth $depth. No more URLs to fetch." >>> break >>> fi >>> segment=`ls -d $crawl/segments/* | tail -1` >>> >>> $NUTCH_HOME/bin/nutch fetch $segment -threads $threads >>> >>> if [ $? -ne 0 ] >>> then >>> echo "runbot: fetch $segment at depth `expr $i + 1` failed." >>> echo "runbot: Deleting segment $segment." >>> rm $RMARGS $segment >>> continue >>> fi >>> >>> echo " ----- Updating Dadatabase ( $steps) -----" >>> >>> >>> $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment >>> >>> done >>> >>> echo "----- Merge Segments (Step 3 of $steps) -----" >>> $NUTCH_HOME/bin/nutch mergesegs $crawl/MERGEDsegments $crawl/segments/* >>> >>> >>> rm $crawl/segments >>> mv $crawl/MERGEDsegments $crawl/segments >>> >>> echo "----- Invert Links (Step 4 of $steps) -----" >>> $NUTCH_HOME/bin/nutch invertlinks $crawl/linkdb $crawl/segments/* >>> >>> >>> >>> >>> _________________________________________________________________ >>> Eligible CDN College & University students can upgrade to Windows 7 before >>> Jan 3 for only $39.99. Upgrade now! >>> http://go.microsoft.com/?linkid=9691819 >>> >>> > > _________________________________________________________________ > Eligible CDN College & University students can upgrade to Windows 7 before > Jan 3 for only $39.99. Upgrade now! > http://go.microsoft.com/?linkid=9691819 >
