(adding back the list)
On Thu, Dec 5, 2013 at 10:02 AM, Guido Trotter <[email protected]> wrote: > On Wed, Dec 4, 2013 at 2:56 PM, Helga Velroyen <[email protected]> wrote: > > This is a design doc addressing issue 377. Objective is > > to reduce the number of nodes that are able to establish > > ssh and RPC connections to other nodes. Limiting this > > set of nodes to the master candidates is desired to > > decrease the risk of a compromised node compromising the > > entire cluster. > > > > Signed-off-by: Helga Velroyen <[email protected]> > > --- > > doc/design-candidates.rst | 379 > ++++++++++++++++++++++++++++++++++++++++++++++ > > doc/index.rst | 1 + > > 2 files changed, 380 insertions(+) > > create mode 100644 doc/design-candidates.rst > > > > diff --git a/doc/design-candidates.rst b/doc/design-candidates.rst > > new file mode 100644 > > index 0000000..1da5b2e > > --- /dev/null > > +++ b/doc/design-candidates.rst > > @@ -0,0 +1,379 @@ > > +================================================ > > +Improvements regarding Master Candidate Security > > +================================================ > > + > > +This document describes an enhancement of Ganeti's security by > restricting > > +the distribution of security-sensitive data to the master and master > > +candidates only. > > + > > +Note: In this document, we will use the term 'normal node' for a node > that > > +is neither master nor master-candidate. > > + > > +.. contents:: :depth: 4 > > + > > +Objective > > +========= > > + > > +Up till 2.10, Ganeti distributed security-relevant keys to all nodes, > > +including nodes that are neither master nor master-candidates. Those > > +keys are the private and public SSH keys for node communication and the > > +SSL certficiate and private key for RPC communication. Objective of this > > +design is to limit the set of nodes that can establish ssh and RPC > > +connections to the master and master candidates. > > + > > +As pointed out in > > +`issue 377 <https://code.google.com/p/ganeti/issues/detail?id=377>`_, > this > > +is a security risk. Since all nodes have these keys, compromising > > +any of those nodes would possibly give an attacker access to all other > > +machines in the cluster. Reducing the set of nodes that are able to > > +make ssh and RPC connections to the master and master candidates would > > +significantly reduce the risk simply because fewer machines would be a > > +valuable target for attackers. > > + > > I would add here that bigger installations could choose to run master > candidates only on non-vm-capable nodes, thus removing the hypervisor > attack surface. > True, I added this paragraph: +Note: For bigger installations of Ganeti, it is advisable to run master +candidate nodes as non-vm-capable nodes. This would reduce the attack +surface for the hypervisor exploitation. + > > > + > > +Detailed design > > +=============== > > + > > + > > +Current state and shortcomings > > +------------------------------ > > + > > +Currently (as of 2.10), all nodes hold the following information: > > + > > +- the ssh host keys (public and private) > > +- the ssh root keys (public and private) > > +- node daemon certificates (the SSL client certificate and its > > + corresponding private key) > > + > > +Concerning ssh, this setup contains the following security issue. Since > > +all nodes of a cluster can ssh as root into any other cluster node, one > > +compromised node can harm all other nodes of a cluster. > > + > > +Regarding the SSL encryption of the RPC communication with the node > > +daemon, we currently have the following setup. There is only one > > +certificate which is used as both, client and server certificate. > Besides > > +the SSL client verification, we check if the used client certificate is > > +the same as the certificate stored on the server. > > + > > +This means that any node running a node daemon can also act as an RPC > > +client and use it to issue RPC calls to other cluster nodes. This in > > +turn means that any compromised node could be used to make RPC calls to > > +any node (including itself) to gain full control over VMs. This could > > +be used by an attacker to for example bring down the VMs or exploit bugs > > +in the virtualization stacks to gain access to the host machines as > well. > > + > > + > > +Proposal concerning SSH key distribution > > +---------------------------------------- > > + > > +We propose to limit the set of nodes holding the private root user SSH > key > > +to the master and the master candidates. This way, the security risk > would > > +be limited to a rather small set of nodes even though the cluster could > > +consists of a lot more nodes. The set of master candidates could be > protected > > +better than the normal nodes (for example residing in a DMZ) to enhance > > +security even more if the administrator wishes so. The following > > +sections describe in detail which Ganeti commands are affected by this > > +change and in what way. > > + > > + > > +(Re-)Adding nodes to a cluster > > +~~~~~~~~~~~~~~~~~~~~~~~~~ > > + > > +According to ``design-node-add.rst``, Ganeti transfers the ssh keys to > every > > +node that gets added to the cluster. > > + > > +We propose to change this procedure to treat master candidates and > normal > > +nodes differently. For master candidates, the procedure would stay as > is. > > +For normal nodes, Ganeti would transfer the public and private ssh host > > +keys (as before) and only the public root key. > > + > > +A normal node would not be able to connect via ssh to other nodes, but > > +the master (and potentially master candidates) can connect to this node. > > + > > +In case of readding a node that used to be in the cluster before, > > +handling of the ssh keys would basically be the same with the following > > +additional modifications: if the node used to be a master or > > +master-candidate node, but will be a normal node after readding, Ganeti > > +should make sure that the private root key is deleted if it is still > > +present on the node. > > + > > + > > +Pro- and demoting a node to/from master candidate > > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > + > > +If the role of a node is changed from 'normal' to 'master_candidate', > the > > +master node should at that point copy the private root ssh key. When > demoting > > +a node from master candidate to a normal node, the key that have been > copied > > +there on promotion or addition should be removed again. > > + > > +This affected the behavior of the following commands: > > + > > +:: > > + gnt-node modify --master-candidate=yes > > + gnt-node modify --master-candidate=no [--auto-promote] > > + > > +If the node has been master candidate already before the command to > promote > > +it was issued, Ganeti does not do anything. > > + > > +Note that when you demote a node from master candidate to normal node, > another > > +master-capable and normal node will be promoted to master candidate. > For this > > +newly promoted node, the same changes apply as if it was explicitely > promoted. > > + > > +The same behavior should be ensured for the corresponding rapi command. > > + > > + > > +Offlining and onlining a node > > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > + > > +When offlining a node, it immediately loses its role as master or master > > +candidate as well. When it is onlined again, it will become master > > +candidate again if it was so before. The handling of the keys should be > done > > +in the same way as when the node is explicitely promoted or demoted to > or from > > +master candidate. See the previous section for details. > > + > > +This affects the command: > > + > > +:: > > + gnt-node modify --offline=yes > > + gnt-node modify --offline=no [--auto-promote] > > + > > +For offlining, the removal of the keys is particularly important, as the > > +detection of a compromised node might be the very reason for the > offlining. > > + > > +The same behavior should be ensured for the corresponding rapi command. > > + > > + > > +Cluster verify > > +~~~~~~~~~~~~~~ > > + > > +To make sure the private root ssh key was not distributed to a normal > > +node, 'gnt-cluster verify' will be extended by a check for the key > > +on normal nodes. Additionally, it will check if the private key is > > +indeed present on master candidates. > > + > > + > > + > > +Proposal regarding node daemon certificates > > +------------------------------------------- > > + > > +Regarding the node daemon certificates, we propose the following changes > > +in the design. > > + > > +- Instead of using the same certificate for all nodes as both, server > > + and client certificate, we generate a common server certificate (and > > + the corresponding private key) for all nodes and a different client > > + certificate (and the corresponding private key) for each node. > > +- In addition, we store a mapping of > > + (node UUID, client certificate digest) in the cluster's configuration > > + and ssconf for hosts that are master or master candidate. > > + The client certificate digest is a hash of the client certificate. > > + We suggest a 'sha1' hash here. We will call this mapping 'candidate > map' > > + from here on. > > +- The node daemon will be modified in a way that on an incoming RPC > > + request, it first performs a client verification (same as before) to > > + ensure that the requesting host is indeed the holder of the > > + corresponding private key. Additionally, it compares the digest of > > + the certificate of the incoming request to the respective entry of > > + the candidate map. If the digest does not match the entry of the host > > + in the mapping or is not included in the mapping at all, the SSL > > + connection is refused. > > + > > +This design has the following advantages: > > + > > +- A compromised normal node cannot issue RPC calls, because it will > > + not be in the candidate map. > > +- A compromised master candidate would be able to issue RPC requests, > > + but on detection of its compromised state, it can be removed from the > > + cluster (and thus from the candidate map) without the need for > > + redistribution of any certificates, because the other master > candidates > > + can continue using their own certificates. > > +- A compromised node would not be able to use the other (possibly master > > + candidate) nodes' information from the candidate map to issue RPCs, > > + because the config just stores the digests and not the certificate > > + itself. > > +- A compromised node would be able to obtain another node's certificate > > + by waiting for incoming RPCs from this other node. However, the node > > + cannot use the certificate to issue RPC calls, because the SSL client > > + verification would require the node to hold the corresponding private > > + key as well. > > + > > +Drawbacks of this design: > > + > > +- Complexity of node and certificate management will be increased (see > > + following sections for details). > > +- If the candidate map is not distributed fast enough to all nodes after > > + an update of the configuration, it might be possible to issue RPC > calls > > + from a compromised master candidate node that has been removed > > + from the Ganeti cluster already. However, this is still a better > > + situation than before and an inherent problem when one wants to > > + distinguish between master candidates and normal nodes. > > + > > +Alternative proposals: > > + > > +- Instead of generating a client certificate per node, one could think > > + of just generating two different client certificates, one for normal > > + nodes and one for master candidates. Noded could then just check if > > + the requesting node has the master candidate certificate. Drawback of > > + this proposal is that once one master candidate gets compromised, all > > + master candidates would need to get a new certificate. > > This is true anyway, since it would be trivial from a compromised > master candidate to fetch all other MCs certificates, e.g. via ssh. > Sure, I rephrased that a little to point out that the improvement helps only as long as the compromised node hasn't fetched anything yet: [...] Drawback of this proposal is that once one master candidate gets compromised, all master candidates would need to get a new certificate even if the compromised master candidate had not yet fetched the certificates from the other master candidates via ssh. > > > +- In addition to our main proposal, one could think of including a > > + piece of data (for example the node's host name or UUID) in the RPC > > + call which is encrypted with the requesting node's private key. The > > + node daemon could check if the datum can be decrypted using the node's > > + certificate. However, this would ensure similar functionality as > > + SSL's built-in client verification and add significant complexity > > + to Ganeti's RPC protocol. > > + > > +In the following sections, we describe how our design affects various > > +Ganeti operations. > > + > > + > > +Cluster initialization > > +~~~~~~~~~~~~~~~~~~~~~~ > > + > > +On cluster initialization, so far only the node daemon certificate was > > +created. With our design, two certificates (and corresponding keys) > > +need to be created, a server certificate to be distributed to all nodes > > +and a client certificate only to be used by this particular node. In the > > +following, we use the term node daemon certificate for the server > > +certficate only. > > + > > +In the cluster configuration, the candidate map is created. It is > > +populated with the respective entry for the master node. It is also > > +written to ssconf. > > + > > + > > +(Re-)Adding nodes > > +~~~~~~~~~~~~ > > + > > +When a node is added, the server certificate is copied to the node (as > > +before). Additionally, a new client certificate (and the corresponding > > +private key) is created on the new node to be used only by the new node > > +as client certifcate. > > + > > +If the new node is a master candidate, the candidate map is extended by > > +the new node's data. As before, the updated configuration is distributed > > +to all nodes (as complete configuration on the master candidates and > > +ssconf on all nodes). Note that distribution of the configuration after > > +adding a node is already implemented, since all nodes hold the list of > > +nodes in the cluster in ssconf anyway. > > + > > +If the configuration for whatever reason already holds an entry for this > > +node, it will be overriden. > > + > > +When readding a node, the procedure is the same as for adding a node. > > +Drawback of > this proposal is that once one master candidate gets compromised, all > master candidates would need to get a new certificate even if the > compromised master candidate had not yet fetched the certrificates > from the other master candidates via ssh. > Drawback of > this proposal is that once one master candidate gets compromised, all > master candidates would need to get a new certificate even if the > compromised master candidate had not yet fetched the certrificates > from the other master candidates via ssh. > > > + > > +Promotion and demotion of master candidates > > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > + > > +When a normal node gets promoted to be master candidate, an entry to the > > +candidate map has to be added and the updated configuration has to be > > +distributed to all nodes. If there was already an entry for the node, > > +we override it. > > + > > +On demotion of a master candidate, the node's entry in the candidate map > > +gets removed and the updated configuration gets redistibuted. > > + > > +The same procedure applied to onlining and offlining master candidates. > > + > > + > > +Cluster verify > > +~~~~~~~~~~~~~~ > > + > > +Cluster verify will be extended by the following checks: > > + > > +- Whether each entry in the candidate map indeed corresponds to a master > > + candidate. > > +- Whether the master candidate's certificate digest match their entry > > + in the candidate map. > > + > > + > > +Crypto renewal > > +~~~~~~~~~~~~~~ > > + > > +Currently, when the cluster's cryptographic tokens are renewed using the > > +``gnt-cluster renew-crypto`` command the node daemon certificate is > > +renewed (among others). Option ``--new-cluster-certificate`` renews the > > +node daemon certificate only. > > + > > +Additionally to the renewal of the node daemon server certificate, we > > +propose to renew all client certificates when ``gnt-cluster > > +renew-crypto`` is called without another option. > > + > > +By adding an option ``--new-node-certificates`` we offer to renew the > > +client certificates only. Whenever the client certificates are renewed, > the > > +candidate map has to be updated and redistributed. > > + > > +If for whatever reason there is an entry in the candidate map of a node > > +that is not a master candidate (for example due inconsistent updating > > +after a demotion or offlining), we offer the user to remove the entry > > +from the candidate list (for example if cluster verify detects this > > +inconsistency). We propose to implement a new option called > > + > > +:: > > + gnt-cluster renew-crypto --update-candidate-map > > + > > +TODO: describe what exactly should happen here > > + > > + > > +Further considerations > > +---------------------- > > + > > +Watcher > > +~~~~~~~ > > + > > +The watcher is a script that is run on all nodes in regular intervals. > The > > +changes proposed in this design will not affect the watcher's > implementation, > > +because it behaves differently on the master than on non-master nodes. > > + > > +Only on the master, it issues query calls which would require a client > > +certificate of a node in the candidate mapping. This is the case for the > > +master node. On non-master node, it's only external communication is > done via > > +the ConfD protocol, which uses the hmac key, which is present on all > nodes. > > +Besides that, the watcher does not make any ssh connections, and thus is > > +not affected by the changes in ssh key handling either. > > + > > + > > +Other Keys > > +~~~~~~~~~~ > > + > > +Ganeti handles a couple of other keys/certificates that have not been > mentioned > > +in this design so far. They will not be affected by this design for > several > > +reasons: > > + > > +- The hmac key used by ConfD (see ``design-2.1.rst``): the hmac key is > still > > + distributed to all nodes, because it was designed to be used for > > + communicating with ConfD, which should be possible from all nodes. > > + For example, the monitoring daemon which runs on all nodes uses it to > > + retrieve information from ConfD. However, since communication with > ConfD > > + is read-only, a compromised node holding the hmac key does not enable > an > > + attacker to change the cluster's state. > > + > > + (TODO: what about WConfD?) > > + > > +- The rapi SSL key certificate and rapi user/password file 'rapi_users' > is > > + already only copied to the master candidates (see ``design-2.1.rst``, > > + Section ``Redistribute Config``). > > + > > +- The spice certificates are still distributed to all nodes, since it > should > > + be possible to use spice to access VMs on any cluster node. > > + > > +- The cluster domain secret is used for inter-cluster instance moves. > > + Since instances can be moved from any normal node of the source > cluster to > > + any normal node of the destination cluster, the presence of this > > + secret on all nodes is necessary. > > + > > + > > +Related and Future Work > > +~~~~~~~~~~~~~~~~~~~~~~~ > > + > > +Ganeti RPC calls are currently done without server verification. > > +Establishing server verification might be a desirable feature, but is > > +not part of this design. > > + > > +.. vim: set textwidth=72 : > > +.. Local Variables: > > +.. mode: rst > > +.. fill-column: 72 > > +.. End: > > diff --git a/doc/index.rst b/doc/index.rst > > index 7ec8162..e1dad68 100644 > > --- a/doc/index.rst > > +++ b/doc/index.rst > > @@ -110,6 +110,7 @@ Draft designs > > cluster-merge.rst > > design-autorepair.rst > > design-bulk-create.rst > > + design-candidates.rst > > design-chained-jobs.rst > > design-cmdlib-unittests.rst > > design-cpu-pinning.rst > > -- > > 1.8.4.1 > > > > LGTM (with the interdiffs). Note that if it would be simpler to just > have two certs instead of one client cert per node I wouldn't mind > that approach either. > Thanks, I'll leave the decision open for now and will decide it once I have had better overview over the code to be able to better estimate the effort it requires. Cheers, Helga
