Single Map Task Requirement for Fetching

Chris Schneider Mon, 20 Feb 2006 20:16:55 -0800

Nutch Developers,

At 9:00pm +0100 1/15/06, Andrzej Bialecki wrote:
>Also, I think the current implementation is not optimal, because it runs only 
>a single map task for a fetcher. The reason for this is that it was the 
>easiest way to ensure that we don't violate the politeness rules - if we ran 
>multiple map tasks the methods blockAddr/unblockAddr in protocol-http couldn't 
>prevent other map tasks from using the same address.
>
>The proper solution is IMHO a central lock manager. I looked at the code, it 
>seems to me that JobTracker could manage this central lock manager (one per 
>job? one per cluster? perhaps both?), this could be a part of a 
>JobSubmissionProtocol - but I think there is no way now for the arbitrary code 
>to reference it's JobClient.. bummer.


As I understand it, the current MapReduce implementation of fetching is 
restricted to running only one map task at a time on each TaskTracker. This is 
because IP blocking can't span multiple JVM instances, so there's no way to 
prevent two child processes on the same TaskTracker from hitting the same 
server simply through a Nutch-0.7-style blocking mechanism.

Assuming that the URLs have already been partitioned (and taking aside the 
merits of partitioning this way vs. by IP address - see my other email), 
wouldn't it be possible for the TaskTracker to avoid having two child processes 
hitting the same domain by ensuring that each was working on a separate domain?

Please forgive my vague understanding of the MapReduce implementation. I may 
also have misunderstood the gist of Andrzej's post (copied above).

Thanks,

- Chris

-- 
------------------------
Chris Schneider
TransPac Software, Inc.
[EMAIL PROTECTED]
------------------------

Single Map Task Requirement for Fetching

Reply via email to