On 18/10/11 10:48, Petru Dimulescu wrote:
Hello,
I wonder how do you guys see the problem of automatic node discovery:
having, for instance, a couple of hadoops, with no configuration
explicitly set whatsoever, simply discover each other and work together,
like Gridgain does: just fire up two instances of the product, on the
same machine or on different machines in the same LAN, they will use
mulitcast or whatever to discover each other
you can use techniques like Bonjour to have hadoop services register
themselves in DNS and locate that way, but things only need to discover
the NN and JT and report in.
> and to be a part of a
self-discovered topology.
Topology inference is an interesting problem. Something purely for
diagnostics could be useful.
Of course, if you have special network requirements you should be able
to specify undiscovarable nodes by IP or name but often grids are
installed on LANs and it should really be simpler.
In a production system I'd have a private switch and isolate things for
bandwidth and security; this is why auto configuration is generally
neglected. If it were to be added, it would go via Zookeeper, leaving
only the zookeeper discovery problem. You can't rely on DNS or multicast
IP here as it doesn't always work in virtualised environments.
Namenodes are a bit different, they should use safer machines, I'm
basically talking about datanodes here, but still I wonder how hard can
it be to have self-assigned namenodes, maybe replicated automatically on
several machines, unless one specific namenode is explicitly set via xml
configuration.
I wouldn't touch dynamic namenodes, you really need fixed NNs and 2nns
and as automatic replication isn't there it's a non-issue.
With fixed NN and JT entries in the DNS table, anything can come up in
the LAN and talk to them unless you set up the master nodes with lists
of things you trust.
Also, the ssh passwordless thing is so awkward. If you have a network of
hadoop that mutually discover each other there is really no need for
this passwordless ssh requirement. This is more of a system
administrator aspect, if sysadmins want to automatically deploy or start
a program on 5000 machines they often have the tools&skills to do that,
it should not be a requirement.
It's not a requirement, there are other ways to deploy. Large clusters
tend to use cluster management tooling that keeps the OS images
consistent, or you can use more devops-centric tooling (inc Apache
Whirr) to roll things out.