hi,
i'm crawling my intranet , and i have setted the db.fetch.interval.default to
be 5 hours, but it seens that it doesnt work correctly
<property>
<name>db.fetch.interval.default</name>
<value>18000</value>
<description>The number of seconds between re-fetches of a page (5 hours ).
</description>
</property>
the first crawl when the crawl directory $crawl doesnt existe yet the crawl
spend just 2 hours (with depth=10).
but when performing the recrawl with the recrawl.sh script (with crawldb full),
it takes like 2 hours for each depth !!
and when i checked the log file i found that one URL is fetched like several
times ! so did my 5 hours db.fetch.interval.default works correctly ?
why it's refetching same URLs several time at each depth (depth =10), i
understood that the timestamp of pages will not allow a refetch since the time
(5 hours) is not spent yet !
plz can you just explain me how this db.fetch.interval.defaul works ? should i
use only one depth since i have all the urls in the crawldb ?
i'm using this recrawl script :
depth=10
echo "----- Inject (Step 1 of $steps) -----"
$NUTCH_HOME/bin/nutch inject $crawl/crawldb urls
echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
for((i=0; i < $depth; i++))
do
echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
$NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments $topN
if [ $? -ne 0 ]
then
echo "runbot: Stopping at depth $depth. No more URLs to fetch."
break
fi
segment=`ls -d $crawl/segments/* | tail -1`
$NUTCH_HOME/bin/nutch fetch $segment -threads $threads
if [ $? -ne 0 ]
then
echo "runbot: fetch $segment at depth `expr $i + 1` failed."
echo "runbot: Deleting segment $segment."
rm $RMARGS $segment
continue
fi
echo " ----- Updating Dadatabase ( $steps) -----"
$NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
done
echo "----- Merge Segments (Step 3 of $steps) -----"
$NUTCH_HOME/bin/nutch mergesegs $crawl/MERGEDsegments $crawl/segments/*
rm $crawl/segments
mv $crawl/MERGEDsegments $crawl/segments
echo "----- Invert Links (Step 4 of $steps) -----"
$NUTCH_HOME/bin/nutch invertlinks $crawl/linkdb $crawl/segments/*
_________________________________________________________________
Eligible CDN College & University students can upgrade to Windows 7 before Jan
3 for only $39.99. Upgrade now!
http://go.microsoft.com/?linkid=9691819