Re: db.fetch.interval.default

reinhard schwab Thu, 03 Dec 2009 13:35:54 -0800

hi,

i have identified one source of such a problem and opened an issue at jira.
you can apply this patch and check whether is solves your problem.


https://issues.apache.org/jira/browse/NUTCH-774

btw you can check also your crawldb for such items - the retry interval
is set to 0.
just dump the crawldb and search for it.

regards
reinhard

BELLINI ADAM schrieb:
> hi,
>
> i'm crawling my intranet , and i have setted the db.fetch.interval.default to 
> be 5 hours, but it seens that it doesnt work correctly
>
> <property>
>
>   <name>db.fetch.interval.default</name>
>
>   <value>18000</value>
>
>   <description>The number of seconds between re-fetches of a page (5 hours ).
>
>   </description>
>
> </property>
>
>
> the first crawl when the crawl directory $crawl doesnt existe yet the crawl 
> spend just 2 hours (with depth=10).
>
>
> but when performing the recrawl with the recrawl.sh script (with crawldb 
> full), it takes like 2 hours for each depth !!
>
> and when i checked the log file i found that one URL is fetched like several 
> times ! so did my 5 hours db.fetch.interval.default works correctly ?
>
> why it's refetching same URLs several time at each depth (depth =10), i 
> understood that the timestamp of pages will not allow a refetch since the 
> time (5 hours) is not spent yet !
>
> plz can you just explain me how this db.fetch.interval.defaul works ? should 
> i use only one depth  since i have all the urls in the crawldb ?
>
>
>
>
> i'm using this recrawl script :
>
> depth=10
>
> echo "----- Inject (Step 1 of $steps) -----"
> $NUTCH_HOME/bin/nutch inject $crawl/crawldb urls
>
> echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
> for((i=0; i < $depth; i++))
>
> do
>
>   echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
>
> $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments $topN
>
>   if [ $? -ne 0 ]
>   then
>     echo "runbot: Stopping at depth $depth. No more URLs to fetch."
>     break
>   fi
>   segment=`ls -d $crawl/segments/* | tail -1`
>
>   $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
>
>   if [ $? -ne 0 ]
>   then
>     echo "runbot: fetch $segment at depth `expr $i + 1` failed."
>     echo "runbot: Deleting segment $segment."
>     rm $RMARGS $segment
>     continue
>   fi
>
> echo " ----- Updating Dadatabase ( $steps) -----"
>
>
>   $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
>
> done
>
> echo "----- Merge Segments (Step 3 of $steps) -----"
> $NUTCH_HOME/bin/nutch mergesegs $crawl/MERGEDsegments $crawl/segments/*
>
>
> rm   $crawl/segments
> mv  $crawl/MERGEDsegments $crawl/segments
>
> echo "----- Invert Links (Step 4 of $steps) -----"
> $NUTCH_HOME/bin/nutch invertlinks $crawl/linkdb $crawl/segments/*
>
>
>
>                                         
> _________________________________________________________________
> Eligible CDN College & University students can upgrade to Windows 7 before 
> Jan 3 for only $39.99. Upgrade now!
> http://go.microsoft.com/?linkid=9691819
>

Re: db.fetch.interval.default

Reply via email to