Take a look at the crawl-delay setting in the robots.txt file on the 
website you are attempting to fetch.  It may be what is slowing you down.

There is a setting fetcher.max.crawl.delay in your nutch-*.xml file that 
can change the behavior for this.  The default is 30 seconds meaning 
nutch will ignore pages if the crawl delay is over 30 seconds.  In the 
robots.txt file it would be set in milliseconds, something like 30000. 
If that website has crawl delays of say 20000 or 20 seconds nutch would 
wait 20 seconds between each webpage request. If this is the case and 
the sight has say 10,000 pages then it would take around 2.3 days to fetch.

Dennis Kubes

cesar voulgaris wrote:
> OK, thanks
> 
> On 2/13/07, cesar voulgaris <[EMAIL PROTECTED]> wrote:
>>
>> hi, maybe someone who has the same problem can help me:
>>
>> I started a crawl, at a certain depth the fetchers logs out the urls
>> aparently correct, but from two days!! it seems  to
>> be fetching the same site (a big one but not so big). What disturbs me is
>> that the segment directory is always  the same size
>> (du -hs segmentdir) it only has crawl_generate as a subdir. Does nutch 
>> has
>> a temporary dir, where it  stores the fetches until it
>> write the other subdirs?...maybe it is hangup?. It hapened two times in
>> diferent crawls (I didi several crawls,not to common)
>>
> 

-------------------------------------------------------------------------
Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier.
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to