Paul,

I am thinking about the mapred branch and the case of a mapred
multiprocess run over one or more machines.  In this case,
multiple tasktracker processes are created.
I'm not sure what you mean.
As far I understand the code there is only one tasktracker per machine.


> why are the taskReportPort and mapOutputPort randomly generated?
> I can not see any reasons for that and wondering why we not just
> have  that configurable as well.

There is a reason to bind to a random port in some cases.  I once has
a process fire off every 5 minutes to make an SSH connection so that
Unison could run over that.  When I picked a static port, it should,
in theory, be available again within 5 minutes, but once
every few days the port would be stuck and new SSH connections would
fail.  I did not determine why it got stuck, but the wait for the
OS closing an unclosed socket is 3 minutes, just in case the close
ACK packet is bouncing around through all possible hops.
As you can see in the tasktracker code, the ports are cleanly closed in case the tasktracker status is worng, since there is a finally sections that closes the outputserver and the reportserver in any case.


If tasktracker ports are picked randomly without retrying when a port is
already busy, then that is problem.  If the ports are picked randomly
until open ports are found, then that is okay. An even better solution is to have the sequence of ports tested be different for each tasktracker
process so that N tacktrackers on one machine don't all simultaneously
race to listen on port P and then P+1, etc. for N-1 consecutive races.

As mentioned having random ports isn't manageable in real life. I do not know any system administrator that will shutdown his iptables firewall since our software just use random ports. Firewall is required since people using map reduce running large services. Larger systems == more interesting for hackers. A VPN would slow down the communication between the nodes very much - I personal tried that and it is very expansive as well. So the simplest solution is to have the port configurable and in case 2 tasktrackers run on one machine, just assign 2 different ports.

Greetings.
Stefan


Reply via email to