FYI, I also renamed the file to 'design-node-security' to match the title
better:

diff --git a/doc/index.rst b/doc/index.rst
index e1dad68..73b2eae 100644
--- a/doc/index.rst
+++ b/doc/index.rst
@@ -110,7 +110,6 @@ Draft designs
    cluster-merge.rst
    design-autorepair.rst
    design-bulk-create.rst
-   design-candidates.rst
    design-chained-jobs.rst
    design-cmdlib-unittests.rst
    design-cpu-pinning.rst
@@ -124,6 +123,7 @@ Draft designs
    design-multi-version-tests.rst
    design-network.rst
    design-node-add.rst
+   design-node-security.rst
    design-oob.rst
    design-openvswitch.rst
    design-opportunistic-locking.rst



On Thu, Dec 5, 2013 at 10:19 AM, Helga Velroyen <[email protected]> wrote:

> (adding back the list)
>
>
> On Thu, Dec 5, 2013 at 10:02 AM, Guido Trotter <[email protected]>wrote:
>
>> On Wed, Dec 4, 2013 at 2:56 PM, Helga Velroyen <[email protected]> wrote:
>> > This is a design doc addressing issue 377. Objective is
>> > to reduce the number of nodes that are able to establish
>> > ssh and RPC connections to other nodes. Limiting this
>> > set of nodes to the master candidates is desired to
>> > decrease the risk of a compromised node compromising the
>> > entire cluster.
>> >
>> > Signed-off-by: Helga Velroyen <[email protected]>
>> > ---
>> >  doc/design-candidates.rst | 379
>> ++++++++++++++++++++++++++++++++++++++++++++++
>> >  doc/index.rst             |   1 +
>> >  2 files changed, 380 insertions(+)
>> >  create mode 100644 doc/design-candidates.rst
>> >
>> > diff --git a/doc/design-candidates.rst b/doc/design-candidates.rst
>> > new file mode 100644
>> > index 0000000..1da5b2e
>> > --- /dev/null
>> > +++ b/doc/design-candidates.rst
>> > @@ -0,0 +1,379 @@
>> > +================================================
>> > +Improvements regarding Master Candidate Security
>> > +================================================
>> > +
>> > +This document describes an enhancement of Ganeti's security by
>> restricting
>> > +the distribution of security-sensitive data to the master and master
>> > +candidates only.
>> > +
>> > +Note: In this document, we will use the term 'normal node' for a node
>> that
>> > +is neither master nor master-candidate.
>> > +
>> > +.. contents:: :depth: 4
>> > +
>> > +Objective
>> > +=========
>> > +
>> > +Up till 2.10, Ganeti distributed security-relevant keys to all nodes,
>> > +including nodes that are neither master nor master-candidates. Those
>> > +keys are the private and public SSH keys for node communication and the
>> > +SSL certficiate and private key for RPC communication. Objective of
>> this
>> > +design is to limit the set of nodes that can establish ssh and RPC
>> > +connections to the master and master candidates.
>> > +
>> > +As pointed out in
>> > +`issue 377 <https://code.google.com/p/ganeti/issues/detail?id=377>`_,
>> this
>> > +is a security risk. Since all nodes have these keys, compromising
>> > +any of those nodes would possibly give an attacker access to all other
>> > +machines in the cluster. Reducing the set of nodes that are able to
>> > +make ssh and RPC connections to the master and master candidates would
>> > +significantly reduce the risk simply because fewer machines would be a
>> > +valuable target for attackers.
>> > +
>>
>> I would add here that bigger installations could choose to run master
>> candidates only on non-vm-capable nodes, thus removing the hypervisor
>> attack surface.
>>
>
> True, I added this paragraph:
>
> +Note: For bigger installations of Ganeti, it is advisable to run master
> +candidate nodes as non-vm-capable nodes. This would reduce the attack
> +surface for the hypervisor exploitation.
> +
>
>
>>
>> > +
>> > +Detailed design
>> > +===============
>> > +
>> > +
>> > +Current state and shortcomings
>> > +------------------------------
>> > +
>> > +Currently (as of 2.10), all nodes hold the following information:
>> > +
>> > +- the ssh host keys (public and private)
>> > +- the ssh root keys (public and private)
>> > +- node daemon certificates (the SSL client certificate and its
>> > +  corresponding private key)
>> > +
>> > +Concerning ssh, this setup contains the following security issue. Since
>> > +all nodes of a cluster can ssh as root into any other cluster node, one
>> > +compromised node can harm all other nodes of a cluster.
>> > +
>> > +Regarding the SSL encryption of the RPC communication with the node
>> > +daemon, we currently have the following setup. There is only one
>> > +certificate which is used as both, client and server certificate.
>> Besides
>> > +the SSL client verification, we check if the used client certificate is
>> > +the same as the certificate stored on the server.
>> > +
>> > +This means that any node running a node daemon can also act as an RPC
>> > +client and use it to issue RPC calls to other cluster nodes. This in
>> > +turn means that any compromised node could be used to make RPC calls to
>> > +any node (including itself) to gain full control over VMs. This could
>> > +be used by an attacker to for example bring down the VMs or exploit
>> bugs
>> > +in the virtualization stacks to gain access to the host machines as
>> well.
>> > +
>> > +
>> > +Proposal concerning SSH key distribution
>> > +----------------------------------------
>> > +
>> > +We propose to limit the set of nodes holding the private root user SSH
>> key
>> > +to the master and the master candidates. This way, the security risk
>> would
>> > +be limited to a rather small set of nodes even though the cluster could
>> > +consists of a lot more nodes. The set of master candidates could be
>> protected
>> > +better than the normal nodes (for example residing in a DMZ) to enhance
>> > +security even more if the administrator wishes so. The following
>> > +sections describe in detail which Ganeti commands are affected by this
>> > +change and in what way.
>> > +
>> > +
>> > +(Re-)Adding nodes to a cluster
>> > +~~~~~~~~~~~~~~~~~~~~~~~~~
>> > +
>> > +According to ``design-node-add.rst``, Ganeti transfers the ssh keys to
>> every
>> > +node that gets added to the cluster.
>> > +
>> > +We propose to change this procedure to treat master candidates and
>> normal
>> > +nodes differently. For master candidates, the procedure would stay as
>> is.
>> > +For normal nodes, Ganeti would transfer the public and private ssh host
>> > +keys (as before) and only the public root key.
>> > +
>> > +A normal node would not be able to connect via ssh to other nodes, but
>> > +the master (and potentially master candidates) can connect to this
>> node.
>> > +
>> > +In case of readding a node that used to be in the cluster before,
>> > +handling of the ssh keys would basically be the same with the following
>> > +additional modifications: if the node used to be a master or
>> > +master-candidate node, but will be a normal node after readding, Ganeti
>> > +should make sure that the private root key is deleted if it is still
>> > +present on the node.
>> > +
>> > +
>> > +Pro- and demoting a node to/from master candidate
>> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> > +
>> > +If the role of a node is changed from 'normal' to 'master_candidate',
>> the
>> > +master node should at that point copy the private root ssh key. When
>> demoting
>> > +a node from master candidate to a normal node, the key that have been
>> copied
>> > +there on promotion or addition should be removed again.
>> > +
>> > +This affected the behavior of the following commands:
>> > +
>> > +::
>> > +  gnt-node modify --master-candidate=yes
>> > +  gnt-node modify --master-candidate=no [--auto-promote]
>> > +
>> > +If the node has been master candidate already before the command to
>> promote
>> > +it was issued, Ganeti does not do anything.
>> > +
>> > +Note that when you demote a node from master candidate to normal node,
>> another
>> > +master-capable and normal node will be promoted to master candidate.
>> For this
>> > +newly promoted node, the same changes apply as if it was explicitely
>> promoted.
>> > +
>> > +The same behavior should be ensured for the corresponding rapi command.
>> > +
>> > +
>> > +Offlining and onlining a node
>> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> > +
>> > +When offlining a node, it immediately loses its role as master or
>> master
>> > +candidate as well. When it is onlined again, it will become master
>> > +candidate again if it was so before. The handling of the keys should
>> be done
>> > +in the same way as when the node is explicitely promoted or demoted to
>> or from
>> > +master candidate. See the previous section for details.
>> > +
>> > +This affects the command:
>> > +
>> > +::
>> > +  gnt-node modify --offline=yes
>> > +  gnt-node modify --offline=no [--auto-promote]
>> > +
>> > +For offlining, the removal of the keys is particularly important, as
>> the
>> > +detection of a compromised node might be the very reason for the
>> offlining.
>> > +
>> > +The same behavior should be ensured for the corresponding rapi command.
>> > +
>> > +
>> > +Cluster verify
>> > +~~~~~~~~~~~~~~
>> > +
>> > +To make sure the private root ssh key was not distributed to a normal
>> > +node, 'gnt-cluster verify' will be extended by a check for the key
>> > +on normal nodes. Additionally, it will check if the private key is
>> > +indeed present on master candidates.
>> > +
>> > +
>> > +
>> > +Proposal regarding node daemon certificates
>> > +-------------------------------------------
>> > +
>> > +Regarding the node daemon certificates, we propose the following
>> changes
>> > +in the design.
>> > +
>> > +- Instead of using the same certificate for all nodes as both, server
>> > +  and client certificate, we generate a common server certificate (and
>> > +  the corresponding private key) for all nodes and a different client
>> > +  certificate (and the corresponding private key) for each node.
>> > +- In addition, we store a mapping of
>> > +  (node UUID, client certificate digest) in the cluster's configuration
>> > +  and ssconf for hosts that are master or master candidate.
>> > +  The client certificate digest is a hash of the client certificate.
>> > +  We suggest a 'sha1' hash here. We will call this mapping 'candidate
>> map'
>> > +  from here on.
>> > +- The node daemon will be modified in a way that on an incoming RPC
>> > +  request, it first performs a client verification (same as before) to
>> > +  ensure that the requesting host is indeed the holder of the
>> > +  corresponding private key. Additionally, it compares the digest of
>> > +  the certificate of the incoming request to the respective entry of
>> > +  the candidate map. If the digest does not match the entry of the host
>> > +  in the mapping or is not included in the mapping at all, the SSL
>> > +  connection is refused.
>> > +
>> > +This design has the following advantages:
>> > +
>> > +- A compromised normal node cannot issue RPC calls, because it will
>> > +  not be in the candidate map.
>> > +- A compromised master candidate would be able to issue RPC requests,
>> > +  but on detection of its compromised state, it can be removed from the
>> > +  cluster (and thus from the candidate map) without the need for
>> > +  redistribution of any certificates, because the other master
>> candidates
>> > +  can continue using their own certificates.
>> > +- A compromised node would not be able to use the other (possibly
>> master
>> > +  candidate) nodes' information from the candidate map to issue RPCs,
>> > +  because the config just stores the digests and not the certificate
>> > +  itself.
>> > +- A compromised node would be able to obtain another node's certificate
>> > +  by waiting for incoming RPCs from this other node. However, the node
>> > +  cannot use the certificate to issue RPC calls, because the SSL client
>> > +  verification would require the node to hold the corresponding private
>> > +  key as well.
>> > +
>> > +Drawbacks of this design:
>> > +
>> > +- Complexity of node and certificate management will be increased (see
>> > +  following sections for details).
>> > +- If the candidate map is not distributed fast enough to all nodes
>> after
>> > +  an update of the configuration, it might be possible to issue RPC
>> calls
>> > +  from a compromised master candidate node that has been removed
>> > +  from the Ganeti cluster already. However, this is still a better
>> > +  situation than before and an inherent problem when one wants to
>> > +  distinguish between master candidates and normal nodes.
>> > +
>> > +Alternative proposals:
>> > +
>> > +- Instead of generating a client certificate per node, one could think
>> > +  of just generating two different client certificates, one for normal
>> > +  nodes and one for master candidates. Noded could then just check if
>> > +  the requesting node has the master candidate certificate. Drawback of
>> > +  this proposal is that once one master candidate gets compromised, all
>> > +  master candidates would need to get a new certificate.
>>
>> This is true anyway, since it would be trivial from a compromised
>> master candidate to fetch all other MCs certificates, e.g. via ssh.
>>
>
> Sure, I rephrased that a little to point out that the improvement helps
> only as long as the compromised node hasn't fetched anything yet:
>
> [...] Drawback of
>   this proposal is that once one master candidate gets compromised, all
>   master candidates would need to get a new certificate even if the
>   compromised master candidate had not yet fetched the certificates
>   from the other master candidates via ssh.
>
>
>
>
>>
>> > +- In addition to our main proposal, one could think of including a
>> > +  piece of data (for example the node's host name or UUID) in the RPC
>> > +  call which is encrypted with the requesting node's private key. The
>> > +  node daemon could check if the datum can be decrypted using the
>> node's
>> > +  certificate. However, this would ensure similar functionality as
>> > +  SSL's built-in client verification and add significant complexity
>> > +  to Ganeti's RPC protocol.
>> > +
>> > +In the following sections, we describe how our design affects various
>> > +Ganeti operations.
>> > +
>> > +
>> > +Cluster initialization
>> > +~~~~~~~~~~~~~~~~~~~~~~
>> > +
>> > +On cluster initialization, so far only the node daemon certificate was
>> > +created. With our design, two certificates (and corresponding keys)
>> > +need to be created, a server certificate to be distributed to all nodes
>> > +and a client certificate only to be used by this particular node. In
>> the
>> > +following, we use the term node daemon certificate for the server
>> > +certficate only.
>> > +
>> > +In the cluster configuration, the candidate map is created. It is
>> > +populated with the respective entry for the master node. It is also
>> > +written to ssconf.
>> > +
>> > +
>> > +(Re-)Adding nodes
>> > +~~~~~~~~~~~~
>> > +
>> > +When a node is added, the server certificate is copied to the node (as
>> > +before). Additionally, a new client certificate (and the corresponding
>> > +private key) is created on the new node to be used only by the new node
>> > +as client certifcate.
>> > +
>> > +If the new node is a master candidate, the candidate map is extended by
>> > +the new node's data. As before, the updated configuration is
>> distributed
>> > +to all nodes (as complete configuration on the master candidates and
>> > +ssconf on all nodes). Note that distribution of the configuration after
>> > +adding a node is already implemented, since all nodes hold the list of
>> > +nodes in the cluster in ssconf anyway.
>> > +
>> > +If the configuration for whatever reason already holds an entry for
>> this
>> > +node, it will be overriden.
>> > +
>> > +When readding a node, the procedure is the same as for adding a node.
>> > +Drawback of
>>   this proposal is that once one master candidate gets compromised, all
>>   master candidates would need to get a new certificate even if the
>>   compromised master candidate had not yet fetched the certrificates
>>   from the other master candidates via ssh.
>> Drawback of
>>   this proposal is that once one master candidate gets compromised, all
>>   master candidates would need to get a new certificate even if the
>>   compromised master candidate had not yet fetched the certrificates
>>   from the other master candidates via ssh.
>>
>> > +
>> > +Promotion and demotion of master candidates
>> > +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
>> > +
>> > +When a normal node gets promoted to be master candidate, an entry to
>> the
>> > +candidate map has to be added and the updated configuration has to be
>> > +distributed to all nodes. If there was already an entry for the node,
>> > +we override it.
>> > +
>> > +On demotion of a master candidate, the node's entry in the candidate
>> map
>> > +gets removed and the updated configuration gets redistibuted.
>> > +
>> > +The same procedure applied to onlining and offlining master candidates.
>> > +
>> > +
>> > +Cluster verify
>> > +~~~~~~~~~~~~~~
>> > +
>> > +Cluster verify will be extended by the following checks:
>> > +
>> > +- Whether each entry in the candidate map indeed corresponds to a
>> master
>> > +  candidate.
>> > +- Whether the master candidate's certificate digest match their entry
>> > +  in the candidate map.
>> > +
>> > +
>> > +Crypto renewal
>> > +~~~~~~~~~~~~~~
>> > +
>> > +Currently, when the cluster's cryptographic tokens are renewed using
>> the
>> > +``gnt-cluster renew-crypto`` command the node daemon certificate is
>> > +renewed (among others). Option ``--new-cluster-certificate`` renews the
>> > +node daemon certificate only.
>> > +
>> > +Additionally to the renewal of the node daemon server certificate, we
>> > +propose to renew all client certificates when ``gnt-cluster
>> > +renew-crypto`` is called without another option.
>> > +
>> > +By adding an option ``--new-node-certificates`` we offer to renew the
>> > +client certificates only. Whenever the client certificates are
>> renewed, the
>> > +candidate map has to be updated and redistributed.
>> > +
>> > +If for whatever reason there is an entry in the candidate map of a node
>> > +that is not a master candidate (for example due inconsistent updating
>> > +after a demotion or offlining), we offer the user to remove the entry
>> > +from the candidate list (for example if cluster verify detects this
>> > +inconsistency). We propose to implement a new option called
>> > +
>> > +::
>> > +  gnt-cluster renew-crypto --update-candidate-map
>> > +
>> > +TODO: describe what exactly should happen here
>> > +
>> > +
>> > +Further considerations
>> > +----------------------
>> > +
>> > +Watcher
>> > +~~~~~~~
>> > +
>> > +The watcher is a script that is run on all nodes in regular intervals.
>> The
>> > +changes proposed in this design will not affect the watcher's
>> implementation,
>> > +because it behaves differently on the master than on non-master nodes.
>> > +
>> > +Only on the master, it issues query calls which would require a client
>> > +certificate of a node in the candidate mapping. This is the case for
>> the
>> > +master node. On non-master node, it's only external communication is
>> done via
>> > +the ConfD protocol, which uses the hmac key, which is present on all
>> nodes.
>> > +Besides that, the watcher does not make any ssh connections, and thus
>> is
>> > +not affected by the changes in ssh key handling either.
>> > +
>> > +
>> > +Other Keys
>> > +~~~~~~~~~~
>> > +
>> > +Ganeti handles a couple of other keys/certificates that have not been
>> mentioned
>> > +in this design so far. They will not be affected by this design for
>> several
>> > +reasons:
>> > +
>> > +- The hmac key used by ConfD (see ``design-2.1.rst``): the hmac key is
>> still
>> > +  distributed to all nodes, because it was designed to be used for
>> > +  communicating with ConfD, which should be possible from all nodes.
>> > +  For example, the monitoring daemon which runs on all nodes uses it to
>> > +  retrieve information from ConfD. However, since communication with
>> ConfD
>> > +  is read-only, a compromised node holding the hmac key does not
>> enable an
>> > +  attacker to change the cluster's state.
>> > +
>> > +  (TODO: what about WConfD?)
>> > +
>> > +- The rapi SSL key certificate and rapi user/password file
>> 'rapi_users' is
>> > +  already only copied to the master candidates (see ``design-2.1.rst``,
>> > +  Section ``Redistribute Config``).
>> > +
>> > +- The spice certificates are still distributed to all nodes, since it
>> should
>> > +  be possible to use spice to access VMs on any cluster node.
>> > +
>> > +- The cluster domain secret is used for inter-cluster instance moves.
>> > +  Since instances can be moved from any normal node of the source
>> cluster to
>> > +  any normal node of the destination cluster, the presence of this
>> > +  secret on all nodes is necessary.
>> > +
>> > +
>> > +Related and Future Work
>> > +~~~~~~~~~~~~~~~~~~~~~~~
>> > +
>> > +Ganeti RPC calls are currently done without server verification.
>> > +Establishing server verification might be a desirable feature, but is
>> > +not part of this design.
>> > +
>> > +.. vim: set textwidth=72 :
>> > +.. Local Variables:
>> > +.. mode: rst
>> > +.. fill-column: 72
>> > +.. End:
>> > diff --git a/doc/index.rst b/doc/index.rst
>> > index 7ec8162..e1dad68 100644
>> > --- a/doc/index.rst
>> > +++ b/doc/index.rst
>> > @@ -110,6 +110,7 @@ Draft designs
>> >     cluster-merge.rst
>> >     design-autorepair.rst
>> >     design-bulk-create.rst
>> > +   design-candidates.rst
>> >     design-chained-jobs.rst
>> >     design-cmdlib-unittests.rst
>> >     design-cpu-pinning.rst
>> > --
>> > 1.8.4.1
>> >
>>
>> LGTM (with the interdiffs). Note that if it would be simpler to just
>> have two certs instead of one client cert per node I wouldn't mind
>> that approach either.
>>
>
> Thanks, I'll leave the decision open for now and will decide it once I
> have had better overview over the code to be able to better estimate the
> effort it requires.
>
> Cheers,
> Helga
>



-- 
-- 
Helga Velroyen | Software Engineer | [email protected] |

Google Germany GmbH
Dienerstr. 12
80331 München

Registergericht und -nummer: Hamburg, HRB 86891
Sitz der Gesellschaft: Hamburg
Geschäftsführer: Graham Law, Christine Elizabeth Flores

Reply via email to