Check nutch-site.xml or nutch-default file.

there are settings
fetcher.server.delay
fetcher.server.min.delay

What for do you want to find idle threads?

Yes, pdf documemnts due to their nature are difficult to parse and hence
they take more time than simple HTML pages. For example a pdf file of 5-6Mb
can be parsed up to a few seconds.


Default nutch pdf parser doesn't extract links from docs but you can
re-write it to do it. PDFbox library allows this.

Alexander

2008/11/26 discoversk <[EMAIL PROTECTED]>

>
> Hello Alexander,
>
> Thanks for clarification, i would like to clarify
> - how do we find Idle threds after executing nutch crawl
> - is there any place in configuration where i could put delay based on
> url??
> >> you means to say( in last communication)  complex documents takes more
> >> time while parsing (pdf/doc)
> - if i use PDFbox/abiword to extract, are they extracting links ?? are they
> helpfull??
>
> If you have input please let me know or is there any good wiki which can
> help me to implement nutch robustly and sould be optimised...
>
> Thanks.
> Saks
>
>
>
>
>
> Alexander Aristov wrote:
> >
> > Fetching speed depends on many factors. Number of threads, number of URLs
> > from single site (by default, pages from one site are downloaded one by
> > one), also there are delays inserted between each downloads to not bomb
> > taget site.
> >
> > And remeber that you not just downlaod them but you also parse them which
> > might be very expensive operation. For example for PDF docs.
> >
> >
> > Alexander
> >
> > 2008/11/26 discoversk <[EMAIL PROTECTED]>
> >
> >>
> >> Hi,
> >>
> >>   I have implemented nutch, but i am not able to download more than get
> >> 30-50kb/sec data, even if we have enoght bandwidth ( 1200k/sec free most
> >> of
> >> the time )...
> >>
> >> i think i might have missed out something while implementation, can any
> >> one
> >> help/point me out those thing which could improve download speed...
> >>
> >> i have tried to achive this by  increasing number of threads, depth and
> >> top
> >> etc ... but in my case i cant see much difference...
> >>
> >>
> >> Thanks in advance.
> >> Saks
> >> --
> >> View this message in context:
> >>
> http://www.nabble.com/Implementing-nutch-to-get-maximum-download-rate-tp20697389p20697389.html
> >> Sent from the Nutch - User mailing list archive at Nabble.com.
> >>
> >>
> >
> >
> > --
> > Best Regards
> > Alexander Aristov
> >
> >
>
> --
> View this message in context:
> http://www.nabble.com/Implementing-nutch-to-get-maximum-download-rate-tp20697389p20699496.html
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>


-- 
Best Regards
Alexander Aristov

Reply via email to