Amandeep Khurana wrote:
Apparently, the file attached was striped off. Here's the link for where you
can get it:
http://www.soe.ucsc.edu/~akhurana/Hadoop_Security.pdf
Amandeep
This is a good paper with test data to go alongside the theory
Introduction
========
-I'd cite NFS as a good equivalent design, the same "we trust you to be
who you say you are" protocol, similar assumptions about the network
("only trusted machines get on it")
-If EC2 does not meet these requirements, you could argue it's fault of
EC2; there's no fundamental reason why it can't offer private VPNs for
clusters the way other infrastructure (VMWare) can
-the whoami call is done by the command line client; different clients
don't even have to do that. Mine doesn't.
-it is not the "superuser" in unix sense, "root", that runs jobs, it is
whichever user started hadoop on that node. It can still be a locked
down user with limited machine rights.
Attacks
====
Add
-unauthorised nodes spoofing other IP addresses (via ARP attacks) and
becoming nodes in the cluster. You could acquire and then keep or
destroy data, or pretend to do work and return false values. Or come up
as a spoof namenode datanode and disrupt all work.
-denial of service attacks: too many heartbeats, etc
-spoof clients running malicious code on the tasktrackers.
Protocol
======
-SSL does need to deal with trust; unless you want to pay for every
server certificate (you may be able to share them), you'll
need to set up your own CA and issuing private certs -leaving you with
the problem of securiing distributing CA public keys and getting SSL
private keys out to nodes securely (and not anything on the net trying
to use your kickstart server to boot a VM with the same mac address as a
trusted server just to get at those keys)
-I'll have to get somebody who understands security protocols to review
the paper. One area I'd flag as trouble is that on virtual machines,
clock drift can be choppy and non-linear. You also have to worry about
clients not being in the right time zone. It is good for everything to
work off one clock (say the namenode) rather than their own. Amazon's S3
authentication protocol has this bug, as do the bits of WS-DM which take
absolute times rather than relative ones (presumably to make operations
idempotent). A the very least, the namenode needs an operation to return
its current time, which callers can then work off
Implementation
-any implementation should be allowed to use different (userid,
credentials) than (whoami , ~/.hadoop). This is to allow workflow
servers and the like to schedule work as different users.
-server side should log success/failures to different Log categories;
with that an JMX instrumentation you can track security attacks.
Overall, a nice paper. Do you have the patches to try it out on a bigger
cluster?