hi,

i'm crawling my intranet , and i have setted the db.fetch.interval.default to 
be 5 hours, but it seens that it doesnt work correctly

<property>

  <name>db.fetch.interval.default</name>

  <value>18000</value>

  <description>The number of seconds between re-fetches of a page (5 hours ).

  </description>

</property>


the first crawl when the crawl directory $crawl doesnt existe yet the crawl 
spend just 2 hours (with depth=10).


but when performing the recrawl with the recrawl.sh script (with crawldb full), 
it takes like 2 hours for each depth !!

and when i checked the log file i found that one URL is fetched like several 
times ! so did my 5 hours db.fetch.interval.default works correctly ?

why it's refetching same URLs several time at each depth (depth =10), i 
understood that the timestamp of pages will not allow a refetch since the time 
(5 hours) is not spent yet !

plz can you just explain me how this db.fetch.interval.defaul works ? should i 
use only one depth  since i have all the urls in the crawldb ?




i'm using this recrawl script :

depth=10

echo "----- Inject (Step 1 of $steps) -----"
$NUTCH_HOME/bin/nutch inject $crawl/crawldb urls

echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
for((i=0; i < $depth; i++))

do

  echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"

$NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments $topN

  if [ $? -ne 0 ]
  then
    echo "runbot: Stopping at depth $depth. No more URLs to fetch."
    break
  fi
  segment=`ls -d $crawl/segments/* | tail -1`

  $NUTCH_HOME/bin/nutch fetch $segment -threads $threads

  if [ $? -ne 0 ]
  then
    echo "runbot: fetch $segment at depth `expr $i + 1` failed."
    echo "runbot: Deleting segment $segment."
    rm $RMARGS $segment
    continue
  fi

echo " ----- Updating Dadatabase ( $steps) -----"


  $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment

done

echo "----- Merge Segments (Step 3 of $steps) -----"
$NUTCH_HOME/bin/nutch mergesegs $crawl/MERGEDsegments $crawl/segments/*


rm   $crawl/segments
mv  $crawl/MERGEDsegments $crawl/segments

echo "----- Invert Links (Step 4 of $steps) -----"
$NUTCH_HOME/bin/nutch invertlinks $crawl/linkdb $crawl/segments/*



                                          
_________________________________________________________________
Eligible CDN College & University students can upgrade to Windows 7 before Jan 
3 for only $39.99. Upgrade now!
http://go.microsoft.com/?linkid=9691819

Reply via email to