[
https://issues.apache.org/jira/browse/AURORA-1790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jake Farrell reassigned AURORA-1790:
------------------------------------
Assignee: Jake Farrell
> Aurora CNI Support
> ------------------
>
> Key: AURORA-1790
> URL: https://issues.apache.org/jira/browse/AURORA-1790
> Project: Aurora
> Issue Type: Epic
> Reporter: Stephan Erb
> Assignee: Jake Farrell
>
> The [Container Network Interface
> (CNI)|https://github.com/containernetworking/cni/blob/master/SPEC.md] is a
> plug-in based networking solution for containers. CNI is [supported by the
> Mesos Unified
> Containerizer|https://github.com/apache/mesos/blob/master/docs/cni.md].
> CNI support in Aurora would enable cluster operators to isolate tasks on the
> network level. This includes features such as IP-per container, or security
> policies ensuring that only designated subsets of containers can communicate
> with each other. Both are important feature for multi-tenant environments.
> h2. Mesos Protobufs
> In order to launch a task using CNI, Mesos requires frameworks to populate
> the
> [NetworkInfo|https://github.com/apache/mesos/blob/0f97117bac3e1382744e9a847ce11b7589fc45bd/include/mesos/mesos.proto#L1916-L1999]
> protobuf. The following shows relevant subset of fields:
> {code}
> /**
> * Describes a container configuration and allows extensible
> * configurations for different container implementations.
> *
> * NOTE: In the Aurora case, this is set as part of ExecutorInfo
> */
> message ContainerInfo {
> ...
> // A list of network requests. A framework can request multiple IP addresses
> // for the container.
> repeated NetworkInfo network_infos = 7;
> ...
> }
> /**
> * Describes a network request from a framework as well as network resolution
> * provided by Mesos.
> */
> message NetworkInfo {
> ...
> // For the CNI case, empty during task/executor launch and only used
> // in TaskStatus messages to inform the framework scheduler about
> // the IP addresses bound to a container
> repeated IPAddress ip_addresses = 5;
> // Name of the network which will be used by network isolator to determine
> // the network that the container joins. It's up to the network isolator
> // to decide how to interpret this field.
> optional string name = 6;
> // To tag certain metadata to be used by Isolator/IPAM, e.g., rack, etc.
> // Opaque to Mesos but interpreted by the CNI plugin
> optional Labels labels = 4;
> ...
> }
> /**
> * Container related information that is resolved during container
> * setup. The information is sent back to the framework as part of the
> * TaskStatus message.
> */
> message ContainerStatus {
> // This field can be reliably used to identify the container IP address.
> repeated NetworkInfo network_infos = 1;
> ...
> }
> {code}
> h2. Challenges
> * In contrast to ports or other resources, this is the first time an
> important detail is only discovered asynchronously after a task has been
> launched, i.e. the scheduler will only learn about the IP addresses of the
> launched task after having received its first {{TaskStatus}}.
> * A task can now live in multiple networks and can have multiple IP addresses.
> h2. Necessary Changes
> In order to implement CNI support in Aurora, several changes across the
> entire code base are needed.
> h3. Mesos
> * As of today, it seems like there is no reliable way to discover
> CNI-assigned IPs from within an executor (see MESOS-6281). This is crucial
> for us, as Thermos is responsible to announce itself into Zookeeper
> serversets.
> h3. Thermos
> * The Observer UI needs to be updated to handle multiple IP addresses.
> * The ZK serverset announcement needs to be adjusted to publish all
> IP-addresses.
> * A replacement/addition for pystachio {{{{mesos.hostname}}}} is required so
> that usercode can discover its current IP addresses. This relates to
> MESOS-6281.
> h3. Aurora Scheduler
> * Feature toggle allowing operators to enabe/disable CNI support.
> * Plumbing of NetworkInfo name and labels touching Thrift API, storage, and
> task launch mechanism.
> * Extension of {{TaskStatusHandlerImpl}} and
> [{{StateManager}}|https://github.com/apache/aurora/blob/783baaefb9a814ca01fad78181fe3df3de5b34af/src/main/java/org/apache/aurora/scheduler/state/StateManager.java]
> storage layer to persist received IP addresses.
> h3. Aurora Client
> * Extension of the Pystachio configuration so that user-defined jobs can join
> operator enabled networks.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)