Sami Siren (JIRA) wrote:
I am not sure to what you refer to by this 3-4 sec but yes I agree threre are more aspects to optimize in fetcher, what I was firstly concerned was the fetching IO speed what was getting ridiculously low (not quite sure when this happened).
Depending on the number of map/reduce tasks, there is a framework overhead to transfer the job JAR file, and start the subprocess on each tasktracker. However, once these are started the framework's overhead should be negligible, because single task is responsible for fetching many urls.
Naturally, for small jobs, with very few urls, the overhead is relatively large.
The symptoms I'm seeing is that eventually most threads end up in blockAddr spin-waiting. Another problem I see is that when the number of fetching threads is high relative to the available bandwidth, the data is trickling in so slowly that the Fetcher.run() decides that it's hung, and aborts the task. What happens then is that the task gets a SUCCEEDED status in tasktracker, although in reality it may have fetched only a small portion of the allotted fetchlist.
We should open more than one ticket to track these separate aspects. And for general discussion the mailing lista are perhaps the best place.
(I'm moving this to the list then). -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
