Hello Alexander,

Thanks for clarification, i would like to clarify
- how do we find Idle threds after executing nutch crawl
- is there any place in configuration where i could put delay based on url??
>> you means to say( in last communication)  complex documents takes more
>> time while parsing (pdf/doc)
- if i use PDFbox/abiword to extract, are they extracting links ?? are they
helpfull??

If you have input please let me know or is there any good wiki which can
help me to implement nutch robustly and sould be optimised...

Thanks.
Saks





Alexander Aristov wrote:
> 
> Fetching speed depends on many factors. Number of threads, number of URLs
> from single site (by default, pages from one site are downloaded one by
> one), also there are delays inserted between each downloads to not bomb
> taget site.
> 
> And remeber that you not just downlaod them but you also parse them which
> might be very expensive operation. For example for PDF docs.
> 
> 
> Alexander
> 
> 2008/11/26 discoversk <[EMAIL PROTECTED]>
> 
>>
>> Hi,
>>
>>   I have implemented nutch, but i am not able to download more than get
>> 30-50kb/sec data, even if we have enoght bandwidth ( 1200k/sec free most
>> of
>> the time )...
>>
>> i think i might have missed out something while implementation, can any
>> one
>> help/point me out those thing which could improve download speed...
>>
>> i have tried to achive this by  increasing number of threads, depth and
>> top
>> etc ... but in my case i cant see much difference...
>>
>>
>> Thanks in advance.
>> Saks
>> --
>> View this message in context:
>> http://www.nabble.com/Implementing-nutch-to-get-maximum-download-rate-tp20697389p20697389.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 
> -- 
> Best Regards
> Alexander Aristov
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Implementing-nutch-to-get-maximum-download-rate-tp20697389p20699496.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to