the crawl date here has state db_unfetched.
it has not been fetched.
are you sure that you dont have crawl dates with retry interval 0 seconds?

grep Retry crawldump | grep -v "18000"

BELLINI ADAM schrieb:
> hi
> i dumped the database, and this is what i found:
>
>
> Status: 1 (db_unfetched)
> Fetch time: Thu Dec 03 15:53:24 EST 2009
> Modified time: Wed Dec 31 19:00:00 EST 1969
> Retries since fetch: 0
> Retry interval: 18000 seconds (0 days)
> Score: 2.0549393
> Signature: null
> Metadata:
>
>
>
>
> so if meeting this url several time in 2 hours, that means becoz of the 0 
> days it gonna be fetched several times ??
> it will not look at the 18000 secondes ???
>
>
> thx
>
>
>
>
>   
>> Date: Thu, 3 Dec 2009 22:39:29 +0100
>> From: [email protected]
>> To: [email protected]
>> Subject: Re: db.fetch.interval.default
>>
>> hi,
>>
>> i have identified one source of such a problem and opened an issue at jira.
>> you can apply this patch and check whether is solves your problem.
>>
>> https://issues.apache.org/jira/browse/NUTCH-774
>>
>> btw you can check also your crawldb for such items - the retry interval
>> is set to 0.
>> just dump the crawldb and search for it.
>>
>> regards
>> reinhard
>>
>> BELLINI ADAM schrieb:
>>     
>>> hi,
>>>
>>> i'm crawling my intranet , and i have setted the db.fetch.interval.default 
>>> to be 5 hours, but it seens that it doesnt work correctly
>>>
>>> <property>
>>>
>>>   <name>db.fetch.interval.default</name>
>>>
>>>   <value>18000</value>
>>>
>>>   <description>The number of seconds between re-fetches of a page (5 hours 
>>> ).
>>>
>>>   </description>
>>>
>>> </property>
>>>
>>>
>>> the first crawl when the crawl directory $crawl doesnt existe yet the crawl 
>>> spend just 2 hours (with depth=10).
>>>
>>>
>>> but when performing the recrawl with the recrawl.sh script (with crawldb 
>>> full), it takes like 2 hours for each depth !!
>>>
>>> and when i checked the log file i found that one URL is fetched like 
>>> several times ! so did my 5 hours db.fetch.interval.default works correctly 
>>> ?
>>>
>>> why it's refetching same URLs several time at each depth (depth =10), i 
>>> understood that the timestamp of pages will not allow a refetch since the 
>>> time (5 hours) is not spent yet !
>>>
>>> plz can you just explain me how this db.fetch.interval.defaul works ? 
>>> should i use only one depth  since i have all the urls in the crawldb ?
>>>
>>>
>>>
>>>
>>> i'm using this recrawl script :
>>>
>>> depth=10
>>>
>>> echo "----- Inject (Step 1 of $steps) -----"
>>> $NUTCH_HOME/bin/nutch inject $crawl/crawldb urls
>>>
>>> echo "----- Generate, Fetch, Parse, Update (Step 2 of $steps) -----"
>>> for((i=0; i < $depth; i++))
>>>
>>> do
>>>
>>>   echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
>>>
>>> $NUTCH_HOME/bin/nutch generate $crawl/crawldb $crawl/segments $topN
>>>
>>>   if [ $? -ne 0 ]
>>>   then
>>>     echo "runbot: Stopping at depth $depth. No more URLs to fetch."
>>>     break
>>>   fi
>>>   segment=`ls -d $crawl/segments/* | tail -1`
>>>
>>>   $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
>>>
>>>   if [ $? -ne 0 ]
>>>   then
>>>     echo "runbot: fetch $segment at depth `expr $i + 1` failed."
>>>     echo "runbot: Deleting segment $segment."
>>>     rm $RMARGS $segment
>>>     continue
>>>   fi
>>>
>>> echo " ----- Updating Dadatabase ( $steps) -----"
>>>
>>>
>>>   $NUTCH_HOME/bin/nutch updatedb $crawl/crawldb $segment
>>>
>>> done
>>>
>>> echo "----- Merge Segments (Step 3 of $steps) -----"
>>> $NUTCH_HOME/bin/nutch mergesegs $crawl/MERGEDsegments $crawl/segments/*
>>>
>>>
>>> rm   $crawl/segments
>>> mv  $crawl/MERGEDsegments $crawl/segments
>>>
>>> echo "----- Invert Links (Step 4 of $steps) -----"
>>> $NUTCH_HOME/bin/nutch invertlinks $crawl/linkdb $crawl/segments/*
>>>
>>>
>>>
>>>                                       
>>> _________________________________________________________________
>>> Eligible CDN College & University students can upgrade to Windows 7 before 
>>> Jan 3 for only $39.99. Upgrade now!
>>> http://go.microsoft.com/?linkid=9691819
>>>   
>>>       
>                                         
> _________________________________________________________________
> Eligible CDN College & University students can upgrade to Windows 7 before 
> Jan 3 for only $39.99. Upgrade now!
> http://go.microsoft.com/?linkid=9691819
>   

Reply via email to