(adding back the list)

On Thu, Dec 5, 2013 at 10:02 AM, Guido Trotter <[email protected]> wrote:

> On Wed, Dec 4, 2013 at 2:56 PM, Helga Velroyen <[email protected]> wrote:
> > This is a design doc addressing issue 377. Objective is
> > to reduce the number of nodes that are able to establish
> > ssh and RPC connections to other nodes. Limiting this
> > set of nodes to the master candidates is desired to
> > decrease the risk of a compromised node compromising the
> > entire cluster.
> >
> > Signed-off-by: Helga Velroyen <[email protected]>
> > ---
> >  doc/design-candidates.rst | 379
> ++++++++++++++++++++++++++++++++++++++++++++++
> >  doc/index.rst             |   1 +
> >  2 files changed, 380 insertions(+)
> >  create mode 100644 doc/design-candidates.rst
> >
> > diff --git a/doc/design-candidates.rst b/doc/design-candidates.rst
> > new file mode 100644
> > index 0000000..1da5b2e
> > --- /dev/null
> > +++ b/doc/design-candidates.rst
> > @@ -0,0 +1,379 @@
> > +================================================
> > +Improvements regarding Master Candidate Security
> > +================================================
> > +
> > +This document describes an enhancement of Ganeti's security by
> restricting
> > +the distribution of security-sensitive data to the master and master
> > +candidates only.
> > +
> > +Note: In this document, we will use the term 'normal node' for a node
> that
> > +is neither master nor master-candidate.
> > +
> > +.. contents:: :depth: 4
> > +
> > +Objective
> > +=========
> > +
> > +Up till 2.10, Ganeti distributed security-relevant keys to all nodes,
> > +including nodes that are neither master nor master-candidates. Those
> > +keys are the private and public SSH keys for node communication and the
> > +SSL certficiate and private key for RPC communication. Objective of this
> > +design is to limit the set of nodes that can establish ssh and RPC
> > +connections to the master and master candidates.
> > +
> > +As pointed out in
> > +`issue 377 <https://code.google.com/p/ganeti/issues/detail?id=377>`_,
> this
> > +is a security risk. Since all nodes have these keys, compromising
> > +any of those nodes would possibly give an attacker access to all other
> > +machines in the cluster. Reducing the set of nodes that are able to
> > +make ssh and RPC connections to the master and master candidates would
> > +significantly reduce the risk simply because fewer machines would be a
> > +valuable target for attackers.
> > +
>
> I would add here that bigger installations could choose to run master
> candidates only on non-vm-capable nodes, thus removing the hypervisor
> attack surface.
>

True, I added this paragraph:

+Note: For bigger installations of Ganeti, it is advisable to run master
+candidate nodes as non-vm-capable nodes. This would reduce the attack
+surface for the hypervisor exploitation.
+


>
> > +
> > +Detailed design
> > +===============
> > +
> > +
> > +Current state and shortcomings
> > +------------------------------
> > +
> > +Currently (as of 2.10), all nodes hold the following information:
> > +
> > +- the ssh host keys (public and private)
> > +- the ssh root keys (public and private)
> > +- node daemon certificates (the SSL client certificate and its
> > +  corresponding private key)
> > +
> > +Concerning ssh, this setup contains the following security issue. Since
> > +all nodes of a cluster can ssh as root into any other cluster node, one
> > +compromised node can harm all other nodes of a cluster.
> > +
> > +Regarding the SSL encryption of the RPC communication with the node
> > +daemon, we currently have the following setup. There is only one
> > +certificate which is used as both, client and server certificate.
> Besides
> > +the SSL client verification, we check if the used client certificate is
> > +the same as the certificate stored on the server.
> > +
> > +This means that any node running a node daemon can also act as an RPC
> > +client and use it to issue RPC calls to other cluster nodes. This in
> > +turn means that any compromised node could be used to make RPC calls to
> > +any node (including itself) to gain full control over VMs. This could
> > +be used by an attacker to for example bring down the VMs or exploit bugs
> > +in the virtualization stacks to gain access to the host machines as
> well.
> > +
> > +
> > +Proposal concerning SSH key distribution
> > +----------------------------------------
> > +
> > +We propose to limit the set of nodes holding the private root user SSH
> key
> > +to the master and the master candidates. This way, the security risk
> would
> > +be limited to a rather small set of nodes even though the cluster could
> > +consists of a lot more nodes. The set of master candidates could be
> protected
> > +better than the normal nodes (for example residing in a DMZ) to enhance
> > +security even more if the administrator wishes so. The following
> > +sections describe in detail which Ganeti commands are affected by this
> > +change and in what way.
> > +
> > +
> > +(Re-)Adding nodes to a cluster
> > +~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +According to ``design-node-add.rst``, Ganeti transfers the ssh keys to
> every
> > +node that gets added to the cluster.
> > +
> > +We propose to change this procedure to treat master candidates and
> normal
> > +nodes differently. For master candidates, the procedure would stay as
> is.
> > +For normal nodes, Ganeti would transfer the public and private ssh host
> > +keys (as before) and only the public root key.
> > +
> > +A normal node would not be able to connect via ssh to other nodes, but
> > +the master (and potentially master candidates) can connect to this node.
> > +
> > +In case of readding a node that used to be in the cluster before,
> > +handling of the ssh keys would basically be the same with the following
> > +additional modifications: if the node used to be a master or
> > +master-candidate node, but will be a normal node after readding, Ganeti
> > +should make sure that the private root key is deleted if it is still
> > +present on the node.
> > +
> > +
> > +Pro- and demoting a node to/from master candidate
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +If the role of a node is changed from 'normal' to 'master_candidate',
> the
> > +master node should at that point copy the private root ssh key. When
> demoting
> > +a node from master candidate to a normal node, the key that have been
> copied
> > +there on promotion or addition should be removed again.
> > +
> > +This affected the behavior of the following commands:
> > +
> > +::
> > +  gnt-node modify --master-candidate=yes
> > +  gnt-node modify --master-candidate=no [--auto-promote]
> > +
> > +If the node has been master candidate already before the command to
> promote
> > +it was issued, Ganeti does not do anything.
> > +
> > +Note that when you demote a node from master candidate to normal node,
> another
> > +master-capable and normal node will be promoted to master candidate.
> For this
> > +newly promoted node, the same changes apply as if it was explicitely
> promoted.
> > +
> > +The same behavior should be ensured for the corresponding rapi command.
> > +
> > +
> > +Offlining and onlining a node
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +When offlining a node, it immediately loses its role as master or master
> > +candidate as well. When it is onlined again, it will become master
> > +candidate again if it was so before. The handling of the keys should be
> done
> > +in the same way as when the node is explicitely promoted or demoted to
> or from
> > +master candidate. See the previous section for details.
> > +
> > +This affects the command:
> > +
> > +::
> > +  gnt-node modify --offline=yes
> > +  gnt-node modify --offline=no [--auto-promote]
> > +
> > +For offlining, the removal of the keys is particularly important, as the
> > +detection of a compromised node might be the very reason for the
> offlining.
> > +
> > +The same behavior should be ensured for the corresponding rapi command.
> > +
> > +
> > +Cluster verify
> > +~~~~~~~~~~~~~~
> > +
> > +To make sure the private root ssh key was not distributed to a normal
> > +node, 'gnt-cluster verify' will be extended by a check for the key
> > +on normal nodes. Additionally, it will check if the private key is
> > +indeed present on master candidates.
> > +
> > +
> > +
> > +Proposal regarding node daemon certificates
> > +-------------------------------------------
> > +
> > +Regarding the node daemon certificates, we propose the following changes
> > +in the design.
> > +
> > +- Instead of using the same certificate for all nodes as both, server
> > +  and client certificate, we generate a common server certificate (and
> > +  the corresponding private key) for all nodes and a different client
> > +  certificate (and the corresponding private key) for each node.
> > +- In addition, we store a mapping of
> > +  (node UUID, client certificate digest) in the cluster's configuration
> > +  and ssconf for hosts that are master or master candidate.
> > +  The client certificate digest is a hash of the client certificate.
> > +  We suggest a 'sha1' hash here. We will call this mapping 'candidate
> map'
> > +  from here on.
> > +- The node daemon will be modified in a way that on an incoming RPC
> > +  request, it first performs a client verification (same as before) to
> > +  ensure that the requesting host is indeed the holder of the
> > +  corresponding private key. Additionally, it compares the digest of
> > +  the certificate of the incoming request to the respective entry of
> > +  the candidate map. If the digest does not match the entry of the host
> > +  in the mapping or is not included in the mapping at all, the SSL
> > +  connection is refused.
> > +
> > +This design has the following advantages:
> > +
> > +- A compromised normal node cannot issue RPC calls, because it will
> > +  not be in the candidate map.
> > +- A compromised master candidate would be able to issue RPC requests,
> > +  but on detection of its compromised state, it can be removed from the
> > +  cluster (and thus from the candidate map) without the need for
> > +  redistribution of any certificates, because the other master
> candidates
> > +  can continue using their own certificates.
> > +- A compromised node would not be able to use the other (possibly master
> > +  candidate) nodes' information from the candidate map to issue RPCs,
> > +  because the config just stores the digests and not the certificate
> > +  itself.
> > +- A compromised node would be able to obtain another node's certificate
> > +  by waiting for incoming RPCs from this other node. However, the node
> > +  cannot use the certificate to issue RPC calls, because the SSL client
> > +  verification would require the node to hold the corresponding private
> > +  key as well.
> > +
> > +Drawbacks of this design:
> > +
> > +- Complexity of node and certificate management will be increased (see
> > +  following sections for details).
> > +- If the candidate map is not distributed fast enough to all nodes after
> > +  an update of the configuration, it might be possible to issue RPC
> calls
> > +  from a compromised master candidate node that has been removed
> > +  from the Ganeti cluster already. However, this is still a better
> > +  situation than before and an inherent problem when one wants to
> > +  distinguish between master candidates and normal nodes.
> > +
> > +Alternative proposals:
> > +
> > +- Instead of generating a client certificate per node, one could think
> > +  of just generating two different client certificates, one for normal
> > +  nodes and one for master candidates. Noded could then just check if
> > +  the requesting node has the master candidate certificate. Drawback of
> > +  this proposal is that once one master candidate gets compromised, all
> > +  master candidates would need to get a new certificate.
>
> This is true anyway, since it would be trivial from a compromised
> master candidate to fetch all other MCs certificates, e.g. via ssh.
>

Sure, I rephrased that a little to point out that the improvement helps
only as long as the compromised node hasn't fetched anything yet:

[...] Drawback of
  this proposal is that once one master candidate gets compromised, all
  master candidates would need to get a new certificate even if the
  compromised master candidate had not yet fetched the certificates
  from the other master candidates via ssh.




>
> > +- In addition to our main proposal, one could think of including a
> > +  piece of data (for example the node's host name or UUID) in the RPC
> > +  call which is encrypted with the requesting node's private key. The
> > +  node daemon could check if the datum can be decrypted using the node's
> > +  certificate. However, this would ensure similar functionality as
> > +  SSL's built-in client verification and add significant complexity
> > +  to Ganeti's RPC protocol.
> > +
> > +In the following sections, we describe how our design affects various
> > +Ganeti operations.
> > +
> > +
> > +Cluster initialization
> > +~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +On cluster initialization, so far only the node daemon certificate was
> > +created. With our design, two certificates (and corresponding keys)
> > +need to be created, a server certificate to be distributed to all nodes
> > +and a client certificate only to be used by this particular node. In the
> > +following, we use the term node daemon certificate for the server
> > +certficate only.
> > +
> > +In the cluster configuration, the candidate map is created. It is
> > +populated with the respective entry for the master node. It is also
> > +written to ssconf.
> > +
> > +
> > +(Re-)Adding nodes
> > +~~~~~~~~~~~~
> > +
> > +When a node is added, the server certificate is copied to the node (as
> > +before). Additionally, a new client certificate (and the corresponding
> > +private key) is created on the new node to be used only by the new node
> > +as client certifcate.
> > +
> > +If the new node is a master candidate, the candidate map is extended by
> > +the new node's data. As before, the updated configuration is distributed
> > +to all nodes (as complete configuration on the master candidates and
> > +ssconf on all nodes). Note that distribution of the configuration after
> > +adding a node is already implemented, since all nodes hold the list of
> > +nodes in the cluster in ssconf anyway.
> > +
> > +If the configuration for whatever reason already holds an entry for this
> > +node, it will be overriden.
> > +
> > +When readding a node, the procedure is the same as for adding a node.
> > +Drawback of
>   this proposal is that once one master candidate gets compromised, all
>   master candidates would need to get a new certificate even if the
>   compromised master candidate had not yet fetched the certrificates
>   from the other master candidates via ssh.
> Drawback of
>   this proposal is that once one master candidate gets compromised, all
>   master candidates would need to get a new certificate even if the
>   compromised master candidate had not yet fetched the certrificates
>   from the other master candidates via ssh.
>
> > +
> > +Promotion and demotion of master candidates
> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +When a normal node gets promoted to be master candidate, an entry to the
> > +candidate map has to be added and the updated configuration has to be
> > +distributed to all nodes. If there was already an entry for the node,
> > +we override it.
> > +
> > +On demotion of a master candidate, the node's entry in the candidate map
> > +gets removed and the updated configuration gets redistibuted.
> > +
> > +The same procedure applied to onlining and offlining master candidates.
> > +
> > +
> > +Cluster verify
> > +~~~~~~~~~~~~~~
> > +
> > +Cluster verify will be extended by the following checks:
> > +
> > +- Whether each entry in the candidate map indeed corresponds to a master
> > +  candidate.
> > +- Whether the master candidate's certificate digest match their entry
> > +  in the candidate map.
> > +
> > +
> > +Crypto renewal
> > +~~~~~~~~~~~~~~
> > +
> > +Currently, when the cluster's cryptographic tokens are renewed using the
> > +``gnt-cluster renew-crypto`` command the node daemon certificate is
> > +renewed (among others). Option ``--new-cluster-certificate`` renews the
> > +node daemon certificate only.
> > +
> > +Additionally to the renewal of the node daemon server certificate, we
> > +propose to renew all client certificates when ``gnt-cluster
> > +renew-crypto`` is called without another option.
> > +
> > +By adding an option ``--new-node-certificates`` we offer to renew the
> > +client certificates only. Whenever the client certificates are renewed,
> the
> > +candidate map has to be updated and redistributed.
> > +
> > +If for whatever reason there is an entry in the candidate map of a node
> > +that is not a master candidate (for example due inconsistent updating
> > +after a demotion or offlining), we offer the user to remove the entry
> > +from the candidate list (for example if cluster verify detects this
> > +inconsistency). We propose to implement a new option called
> > +
> > +::
> > +  gnt-cluster renew-crypto --update-candidate-map
> > +
> > +TODO: describe what exactly should happen here
> > +
> > +
> > +Further considerations
> > +----------------------
> > +
> > +Watcher
> > +~~~~~~~
> > +
> > +The watcher is a script that is run on all nodes in regular intervals.
> The
> > +changes proposed in this design will not affect the watcher's
> implementation,
> > +because it behaves differently on the master than on non-master nodes.
> > +
> > +Only on the master, it issues query calls which would require a client
> > +certificate of a node in the candidate mapping. This is the case for the
> > +master node. On non-master node, it's only external communication is
> done via
> > +the ConfD protocol, which uses the hmac key, which is present on all
> nodes.
> > +Besides that, the watcher does not make any ssh connections, and thus is
> > +not affected by the changes in ssh key handling either.
> > +
> > +
> > +Other Keys
> > +~~~~~~~~~~
> > +
> > +Ganeti handles a couple of other keys/certificates that have not been
> mentioned
> > +in this design so far. They will not be affected by this design for
> several
> > +reasons:
> > +
> > +- The hmac key used by ConfD (see ``design-2.1.rst``): the hmac key is
> still
> > +  distributed to all nodes, because it was designed to be used for
> > +  communicating with ConfD, which should be possible from all nodes.
> > +  For example, the monitoring daemon which runs on all nodes uses it to
> > +  retrieve information from ConfD. However, since communication with
> ConfD
> > +  is read-only, a compromised node holding the hmac key does not enable
> an
> > +  attacker to change the cluster's state.
> > +
> > +  (TODO: what about WConfD?)
> > +
> > +- The rapi SSL key certificate and rapi user/password file 'rapi_users'
> is
> > +  already only copied to the master candidates (see ``design-2.1.rst``,
> > +  Section ``Redistribute Config``).
> > +
> > +- The spice certificates are still distributed to all nodes, since it
> should
> > +  be possible to use spice to access VMs on any cluster node.
> > +
> > +- The cluster domain secret is used for inter-cluster instance moves.
> > +  Since instances can be moved from any normal node of the source
> cluster to
> > +  any normal node of the destination cluster, the presence of this
> > +  secret on all nodes is necessary.
> > +
> > +
> > +Related and Future Work
> > +~~~~~~~~~~~~~~~~~~~~~~~
> > +
> > +Ganeti RPC calls are currently done without server verification.
> > +Establishing server verification might be a desirable feature, but is
> > +not part of this design.
> > +
> > +.. vim: set textwidth=72 :
> > +.. Local Variables:
> > +.. mode: rst
> > +.. fill-column: 72
> > +.. End:
> > diff --git a/doc/index.rst b/doc/index.rst
> > index 7ec8162..e1dad68 100644
> > --- a/doc/index.rst
> > +++ b/doc/index.rst
> > @@ -110,6 +110,7 @@ Draft designs
> >     cluster-merge.rst
> >     design-autorepair.rst
> >     design-bulk-create.rst
> > +   design-candidates.rst
> >     design-chained-jobs.rst
> >     design-cmdlib-unittests.rst
> >     design-cpu-pinning.rst
> > --
> > 1.8.4.1
> >
>
> LGTM (with the interdiffs). Note that if it would be simpler to just
> have two certs instead of one client cert per node I wouldn't mind
> that approach either.
>

Thanks, I'll leave the decision open for now and will decide it once I have
had better overview over the code to be able to better estimate the effort
it requires.

Cheers,
Helga

Reply via email to