On 18/10/11 10:48, Petru Dimulescu wrote:
Hello,

I wonder how do you guys see the problem of automatic node discovery:
having, for instance, a couple of hadoops, with no configuration
explicitly set whatsoever, simply discover each other and work together,
like Gridgain does: just fire up two instances of the product, on the
same machine or on different machines in the same LAN, they will use
mulitcast or whatever to discover each other

you can use techniques like Bonjour to have hadoop services register themselves in DNS and locate that way, but things only need to discover the NN and JT and report in.

> and to be a part of a
self-discovered topology.


Topology inference is an interesting problem. Something purely for diagnostics could be useful.

Of course, if you have special network requirements you should be able
to specify undiscovarable nodes by IP or name but often grids are
installed on LANs and it should really be simpler.

In a production system I'd have a private switch and isolate things for bandwidth and security; this is why auto configuration is generally neglected. If it were to be added, it would go via Zookeeper, leaving only the zookeeper discovery problem. You can't rely on DNS or multicast IP here as it doesn't always work in virtualised environments.


Namenodes are a bit different, they should use safer machines, I'm
basically talking about datanodes here, but still I wonder how hard can
it be to have self-assigned namenodes, maybe replicated automatically on
several machines, unless one specific namenode is explicitly set via xml
configuration.

I wouldn't touch dynamic namenodes, you really need fixed NNs and 2nns and as automatic replication isn't there it's a non-issue.

With fixed NN and JT entries in the DNS table, anything can come up in the LAN and talk to them unless you set up the master nodes with lists of things you trust.


Also, the ssh passwordless thing is so awkward. If you have a network of
hadoop that mutually discover each other there is really no need for
this passwordless ssh requirement. This is more of a system
administrator aspect, if sysadmins want to automatically deploy or start
a program on 5000 machines they often have the tools&skills to do that,
it should not be a requirement.

It's not a requirement, there are other ways to deploy. Large clusters tend to use cluster management tooling that keeps the OS images consistent, or you can use more devops-centric tooling (inc Apache Whirr) to roll things out.

Reply via email to