On Jul 18, 2011, at 12:53 PM, Ben Clay wrote: > I'd like to spread Hadoop across two physical clusters, one which is > publicly accessible and the other which is behind a NAT. The NAT'd machines > will only run TaskTrackers, not HDFS, and not Reducers either (configured > with 0 Reduce slots). The master node will run in the publicly-available > cluster.
Off the top, I doubt it will work : MR is bi-directional, across many random ports. So I would suspect there is going to be a lot of hackiness in the network config to make this work. > 1. Port 50060 needs to be opened for all NAT'd machines, since Reduce tasks > fetch intermediate data from http:// > <http://%3ctasktracker%3e:50060/mapOutput> <tasktracker>:50060/mapOutput, > correct ? I'm getting "Too many fetch-failures" with no open ports, so I > assume the Reduce tasks need to pull the intermediate data instead of Map > tasks pushing it. Correct. Reduce tasks pull. > 2. Although the NAT'd machines have unique IPs and reach the outside, the > DHCP is not assigning them hostnames. Therefore, when they join the > JobTracker I get > "tracker_localhost.localdomain:localhost.localdomain/127.0.0.1" on the > machine list page. Is there some way to force Hadoop to refer to them via > IP instead of hostname, since I don't have control over the DHCP? I could > manually assign a hostname via /etc/hosts on each NAT'd machine, but these > are actually VMs and I will have many of them receiving semi-random IPs, > making this an ugly administrative task. Short answer: no. Long answer: no, fix your DHCP and/or do the /etc/hosts hack.