Re: Design for security in Hadoop

Steve Loughran Thu, 19 Mar 2009 04:32:03 -0700

Amandeep Khurana wrote:

Apparently, the file attached was striped off. Here's the link for where you
can get it:
http://www.soe.ucsc.edu/~akhurana/Hadoop_Security.pdf


Amandeep


This is a good paper with test data to go alongside the theory
Introduction
========

-I'd cite NFS as a good equivalent design, the same "we trust you to bewho you say you are" protocol, similar assumptions about the network("only trusted machines get on it")-If EC2 does not meet these requirements, you could argue it's fault ofEC2; there's no fundamental reason why it can't offer private VPNs forclusters the way other infrastructure (VMWare) can-the whoami call is done by the command line client; different clientsdon't even have to do that. Mine doesn't.-it is not the "superuser" in unix sense, "root", that runs jobs, it iswhichever user started hadoop on that node. It can still be a lockeddown user with limited machine rights.


Attacks
====
Add

-unauthorised nodes spoofing other IP addresses (via ARP attacks) andbecoming nodes in the cluster. You could acquire and then keep ordestroy data, or pretend to do work and return false values. Or come upas a spoof namenode datanode and disrupt all work.

-denial of service attacks: too many heartbeats, etc
-spoof clients running malicious code on the tasktrackers.

Protocol
======

-SSL does need to deal with trust; unless you want to pay for everyserver certificate (you may be able to share them), you'llneed to set up your own CA and issuing private certs -leaving you withthe problem of securiing distributing CA public keys and getting SSLprivate keys out to nodes securely (and not anything on the net tryingto use your kickstart server to boot a VM with the same mac address as atrusted server just to get at those keys)

-I'll have to get somebody who understands security protocols to reviewthe paper. One area I'd flag as trouble is that on virtual machines,clock drift can be choppy and non-linear. You also have to worry aboutclients not being in the right time zone. It is good for everything towork off one clock (say the namenode) rather than their own. Amazon's S3authentication protocol has this bug, as do the bits of WS-DM which takeabsolute times rather than relative ones (presumably to make operationsidempotent). A the very least, the namenode needs an operation to returnits current time, which callers can then work off


Implementation

-any implementation should be allowed to use different (userid,credentials) than (whoami , ~/.hadoop). This is to allow workflowservers and the like to schedule work as different users.-server side should log success/failures to different Log categories;with that an JMX instrumentation you can track security attacks.

Overall, a nice paper. Do you have the patches to try it out on a biggercluster?

Re: Design for security in Hadoop

Reply via email to