Re: How do people keep their client configurations in sync with the remote cluster(s)

Steve Loughran Fri, 16 May 2008 02:04:41 -0700

Ted Dunning wrote:

I use several strategies:


A) avoid dependency on hadoop's configuration by using http access to files.
I use this, for example, where we have a PHP or grails or oracle app that
needs to read a data file or three from HDFS.

B) rsync early and often and lock down the config directory.

C) get a really good sysop who does (b) and shoots people who mess up

D) (we don't do this yet) establish a configuration repository using
zookeeper or a webdav or a (horrors) NFS file system.  At the very least, I
would like to be able to get namenode address and port.

I think in a single organisation, you can get away with SVN managementof conf files -if you build everything in. With an HTTP-svn bridge youcould aways pull in the file over http during startup.


Mostly, our apps are in the cluster and covered by b+c or very out of the
cluster and covered by a.  Many of our apps are pure import or pure export.
The import side really only needs to know where the namenode is and the pure
export only really needs the http access.  That makes the configuration
management task vastly easier.

Another serious (as in SERIOUS) problem is how you keep data-processing
elements from a QA or staging data chain from inserting bogus data into the
production data chain, but still have them work in production with minimal

reconfiguration on final deploy.

That's a problem on all projects. One team I was on once had a classicdisaster where a test cluster bonded to the production MSSQL database(Remember, windows likes flat naming \\database-3 style names) andcaused plenty of damage. Still, its good to test your backup strategyworks -at least in hindsight. It never seems so good at the time, though.

We don't have a particularly good solution
for that yet, but are planning on using zookeeper host based permissions to
good effect there.  That should let us have data mirrors that shadow the
production data feed system so that staged systems can process live data,
but be unable to insert it back into the production setting.  The mirror
will have read-only access to the feed meta-data and the staging machines
will have no access to the production feed meta-data and these limitations
will be imposed by a single configuration on the zookeeper rather than on
each machine.  This should allow us to keep it cleaner than these things
normally wind up.

But the short answer is that this is a hard problem to get really, really

right.

I agree. I think right now clients need a bit too much info about thename node; its URL should be all they need, and presumably who to log inas. In a local cluster you can use discovery services to get a list ofmachines offering the service too, though its that kind of automaticbinding that leads to the erased database problem I've hit before.

One thing that might be useful over time would be more client-sidediagnostics. if you type ant -diagnostics (or run <diagnostics> ant runsthrough a set of health checks that have caused problems in the past

 -mixed up JAR versions
 -tmp dir unwriteable
 -tmp dir on a filesystem with a different clock/TZ from the local machines
 -bad proxy settings
 -wrong XML parser

Some you can test, some it just prints. What it prints out is enough fora support email/bugrep.

For Hadoop, something that prints out the config and looks for good/badproblems, maybe even does nslookup() on the hosts, checks the ports areopen, etc, would be nice, especially if it provides hints when thingsare screwed up.


-Steve

--
Steve Loughran                  http://www.1060.org/blogxter/publish/5
Author: Ant in Action           http://antbook.org/

Re: How do people keep their client configurations in sync with the remote cluster(s)

Reply via email to