Re: TaskTrackers behind NAT

Allen Wittenauer Mon, 18 Jul 2011 17:31:35 -0700

On Jul 18, 2011, at 12:53 PM, Ben Clay wrote:

> I'd like to spread Hadoop across two physical clusters, one which is
> publicly accessible and the other which is behind a NAT. The NAT'd machines
> will only run TaskTrackers, not HDFS, and not Reducers either (configured
> with 0 Reduce slots).  The master node will run in the publicly-available
> cluster.


        Off the top, I doubt it will work : MR is bi-directional, across many 
random ports.  So I would suspect there is going to be a lot of hackiness in 
the network config to make this work.

> 1. Port 50060 needs to be opened for all NAT'd machines, since Reduce tasks
> fetch intermediate data from http://
> <http://%3ctasktracker%3e:50060/mapOutput> <tasktracker>:50060/mapOutput,
> correct ?  I'm getting "Too many fetch-failures" with no open ports, so I
> assume the Reduce tasks need to pull the intermediate data instead of Map
> tasks pushing it.

        Correct. Reduce tasks pull.

> 2. Although the NAT'd machines have unique IPs and reach the outside, the
> DHCP is not assigning them hostnames.  Therefore, when they join the
> JobTracker I get
> "tracker_localhost.localdomain:localhost.localdomain/127.0.0.1" on the
> machine list page.  Is there some way to force Hadoop to refer to them via
> IP instead of hostname, since I don't have control over the DHCP? I could
> manually assign a hostname via /etc/hosts on each NAT'd machine, but these
> are actually VMs and I will have many of them receiving semi-random IPs,
> making this an ugly administrative task.


        Short answer: no.

        Long answer: no, fix your DHCP and/or do the /etc/hosts hack.

Re: TaskTrackers behind NAT

Reply via email to