Re: (NUTCH-339) Refactor nutch to allow fetcher improvements

Andrzej Bialecki Fri, 04 Aug 2006 09:46:50 -0700

Sami Siren (JIRA) wrote:

I am not sure to what you refer to by this 3-4 sec but yes I agree threre are 
more aspects to optimize in fetcher, what I was firstly concerned was the 
fetching IO speed what was getting ridiculously low (not quite sure when this 
happened).

Depending on the number of map/reduce tasks, there is a frameworkoverhead to transfer the job JAR file, and start the subprocess on eachtasktracker. However, once these are started the framework's overheadshould be negligible, because single task is responsible for fetchingmany urls.

Naturally, for small jobs, with very few urls, the overhead isrelatively large.

The symptoms I'm seeing is that eventually most threads end up inblockAddr spin-waiting. Another problem I see is that when the number offetching threads is high relative to the available bandwidth, the datais trickling in so slowly that the Fetcher.run() decides that it's hung,and aborts the task. What happens then is that the task gets a SUCCEEDEDstatus in tasktracker, although in reality it may have fetched only asmall portion of the allotted fetchlist.

We should open more than one ticket to track these separate aspects. And for 
general discussion the mailing lista are perhaps the best place.

(I'm moving this to the list then).


--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: (NUTCH-339) Refactor nutch to allow fetcher improvements

Reply via email to