Re: (NUTCH-339) Refactor nutch to allow fetcher improvements

Sami Siren Fri, 04 Aug 2006 10:50:27 -0700

Uroš Gruber wrote:

Andrzej Bialecki wrote:
Sami Siren (JIRA) wrote:
I am not sure to what you refer to by this 3-4 sec but yes I agreethrere are more aspects to optimize in fetcher, what I was firstlyconcerned was the fetching IO speed what was getting ridiculouslylow (not quite sure when this happened).
I set DEBUG level loging and I've checked time during operations andwhen doint MapReduce job which is run after every page it takes 3-4seconds till next url is fethed.
I have some local site and fetching 100 pages takes about 6 minutes.


Even I havent's seen it go that slow :)

Depending on the number of map/reduce tasks, there is a frameworkoverhead to transfer the job JAR file, and start the subprocess oneach tasktracker. However, once these are started the framework'soverhead should be negligible, because single task is responsible forfetching many urls.
Naturally, for small jobs, with very few urls, the overhead isrelatively large.
The symptoms I'm seeing is that eventually most threads end up inblockAddr spin-waiting. Another problem I see is that when the numberof fetching threads is high relative to the available bandwidth, thedata is trickling in so slowly that the Fetcher.run() decides thatit's hung, and aborts the task. What happens then is that the taskgets a SUCCEEDED status in tasktracker, although in reality it mayhave fetched only a small portion of the allotted fetchlist.
I would like to help find what cause such slowness. Version 0.7 didnot use MapReduce and fetching was done about 20 pages per second onthe same server. With same site fetching is reduced to 0.3 pages persecond.

With queue based solution I just did a crawl of about 600k pages and itaveraged 16 pps (1376 kb/s) with parsing enabled. Perhaps you could tryAndrzejs new Fetcher and see how it performs for you (I haven't yet readthe code ot tested it my self).


--
Sami Siren

Re: (NUTCH-339) Refactor nutch to allow fetcher improvements

Reply via email to