Ted Dunning wrote:
I use several strategies:

A) avoid dependency on hadoop's configuration by using http access to files.
I use this, for example, where we have a PHP or grails or oracle app that
needs to read a data file or three from HDFS.

B) rsync early and often and lock down the config directory.

C) get a really good sysop who does (b) and shoots people who mess up

D) (we don't do this yet) establish a configuration repository using
zookeeper or a webdav or a (horrors) NFS file system.  At the very least, I
would like to be able to get namenode address and port.

I think in a single organisation, you can get away with SVN management of conf files -if you build everything in. With an HTTP-svn bridge you could aways pull in the file over http during startup.


Mostly, our apps are in the cluster and covered by b+c or very out of the
cluster and covered by a.  Many of our apps are pure import or pure export.
The import side really only needs to know where the namenode is and the pure
export only really needs the http access.  That makes the configuration
management task vastly easier.

Another serious (as in SERIOUS) problem is how you keep data-processing
elements from a QA or staging data chain from inserting bogus data into the
production data chain, but still have them work in production with minimal
reconfiguration on final deploy.

That's a problem on all projects. One team I was on once had a classic disaster where a test cluster bonded to the production MSSQL database (Remember, windows likes flat naming \\database-3 style names) and caused plenty of damage. Still, its good to test your backup strategy works -at least in hindsight. It never seems so good at the time, though.

We don't have a particularly good solution
for that yet, but are planning on using zookeeper host based permissions to
good effect there.  That should let us have data mirrors that shadow the
production data feed system so that staged systems can process live data,
but be unable to insert it back into the production setting.  The mirror
will have read-only access to the feed meta-data and the staging machines
will have no access to the production feed meta-data and these limitations
will be imposed by a single configuration on the zookeeper rather than on
each machine.  This should allow us to keep it cleaner than these things
normally wind up.

But the short answer is that this is a hard problem to get really, really
right.

I agree. I think right now clients need a bit too much info about the name node; its URL should be all they need, and presumably who to log in as. In a local cluster you can use discovery services to get a list of machines offering the service too, though its that kind of automatic binding that leads to the erased database problem I've hit before.

One thing that might be useful over time would be more client-side diagnostics. if you type ant -diagnostics (or run <diagnostics> ant runs through a set of health checks that have caused problems in the past
 -mixed up JAR versions
 -tmp dir unwriteable
 -tmp dir on a filesystem with a different clock/TZ from the local machines
 -bad proxy settings
 -wrong XML parser

Some you can test, some it just prints. What it prints out is enough for a support email/bugrep.

For Hadoop, something that prints out the config and looks for good/bad problems, maybe even does nslookup() on the hosts, checks the ports are open, etc, would be nice, especially if it provides hints when things are screwed up.

-Steve

--
Steve Loughran                  http://www.1060.org/blogxter/publish/5
Author: Ant in Action           http://antbook.org/

Reply via email to