Amandeep Khurana wrote:
Thanks for the feedback Steve.
My response on the points that you have mentioned are written inline below.
Amandeep
Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz
On Thu, Mar 19, 2009 at 4:31 AM, Steve Loughran <ste...@apache.org> wrote:
Amandeep Khurana wrote:
Apparently, the file attached was striped off. Here's the link for where
you
can get it:
http://www.soe.ucsc.edu/~akhurana/Hadoop_Security.pdf<http://www.soe.ucsc.edu/%7Eakhurana/Hadoop_Security.pdf>
Amandeep
This is a good paper with test data to go alongside the theory
Introduction
========
-I'd cite NFS as a good equivalent design, the same "we trust you to be who
you say you are" protocol, similar assumptions about the network ("only
trusted machines get on it")
-If EC2 does not meet these requirements, you could argue it's fault of
EC2; there's no fundamental reason why it can't offer private VPNs for
clusters the way other infrastructure (VMWare) can
-the whoami call is done by the command line client; different clients
don't even have to do that. Mine doesn't.
-it is not the "superuser" in unix sense, "root", that runs jobs, it is
whichever user started hadoop on that node. It can still be a locked down
user with limited machine rights.
I'll look into the NFS security stuff in detail and then add it later.
The key point about NFS security is there was none, because the early
eighties, the idea of a linux laptop getting on your wifi network was
not conceivable, so you really could trust workstations. It was only
with PC-NFS that the assumptions started to fail.
Where did EC2 come into picture?
Its an example of a place where Hadoop is deployed where the assumption
that only trusted users have network access (and/or only fixed IP
addresses can join the cluster) don't hold.
Yes, the whoami can be bypassed, thats why the whole thing around
authentication.
By superuser, I meant the user who starts the hadoop instance... Will make
it clearer in the writing.
OK
Attacks
====
Add
-unauthorised nodes spoofing other IP addresses (via ARP attacks) and
becoming nodes in the cluster. You could acquire and then keep or destroy
data, or pretend to do work and return false values. Or come up as a spoof
namenode datanode and disrupt all work.
-denial of service attacks: too many heartbeats, etc
-spoof clients running malicious code on the tasktrackers.
I havent looked these attacks. This paper is not focussing on that. This can
definitely be looked at and incorporated at a later stage. Lets go step by
step. (Debatable)
I was just broadening the list of attacks. Spoofing joining the cluster
is something to fear.
Protocol
======
-SSL does need to deal with trust; unless you want to pay for every server
certificate (you may be able to share them), you'll
need to set up your own CA and issuing private certs -leaving you with the
problem of securiing distributing CA public keys and getting SSL private
keys out to nodes securely (and not anything on the net trying to use your
kickstart server to boot a VM with the same mac address as a trusted server
just to get at those keys)
SSL is a possible solution but the details arent the focus of this design.
Regarding the other keys, there is a format around which they are created
and you dont need a CA for that.
-I'll have to get somebody who understands security protocols to review the
paper. One area I'd flag as trouble is that on virtual machines, clock drift
can be choppy and non-linear. You also have to worry about clients not being
in the right time zone. It is good for everything to work off one clock (say
the namenode) rather than their own. Amazon's S3 authentication protocol has
this bug, as do the bits of WS-DM which take absolute times rather than
relative ones (presumably to make operations idempotent). A the very least,
the namenode needs an operation to return its current time, which callers
can then work off
The time issue is definitely a concern and has to be somehow cracked. The
namenode giving its time is a good idea. But the sync would still be
important. There is a way to sync the time across the cluster. I dont
remember it clearly, but I have it on my "little" cluster. I'll look that
up.
NTP is the normal protocol, everyone tries to use it. But asking the NN
for its clock would avoid having to rely on everything being in sync at
the OS level -and would let the client detect when its clock had drifted
too far off for a conversation. One recurrent problem of mine is
machines that are on NTP but whose time zones are wrong; they are
perfectly accurate to the second but 8 hours out.
-steve