[openstack-dev] TC Meeting / Savanna Incubation Follow-Up

Sergey Lukjanov Wed, 11 Sep 2013 13:20:15 -0700

Hi folks,

Initial discussions of Savanna Incubation request have been started yesterday. 
Two major topics being discussed were Heat integration and “clustering library” 
[1].

To start with let me give a brief overview of key Savanna features:
1. Provisioning of underlying OpenStack resources (like compute, volume,
network) required for Hadoop cluster.
2. Hadoop cluster deployment and configuration.
3. Integration with different Hadoop distributions through plugin mechanism
with single control plan for all of them. In future can be used to integrate
with other Data Processing frameworks, for example, Twitter Storm.
4. Reliability and performance optimizations to ensure Hadoop cluster
performance on top of OpenStack, like enabling Swift to be used as underlying
HDFS and exposing information on Swift data locality to Hadoop scheduler.
5. Set of Elastic Data Processing features:
* Hadoop jobs on-demand execution
* Pool of different external data sources, like Swift, external Hadoop
cluster, NoSQL and traditional databases
* Pig and Hive integration
6. OpenStack Dashboard plugin for all above.

I highly recommend to view our screencast about Savanna 0.2 release (mid July)
[2] to better understand Savanna functionality.

As you can see, resources provisioning is just one of the features and the
implementation details are not critical for overall architecture. It performs
only the first step of the cluster setup. We’ve been considering Heat for a
while, but ended up direct API calls in favor of speed and simplicity. Going
forward Heat integration will be done by implementing extension mechanism [3]
and [4] as part of Icehouse release.

The next part, Hadoop cluster configuration, already extensible and we have
several plugins - Vanilla, Hortonworks Data Platform and Cloudera plugin
started too. This allow to unify management of different Hadoop distributions
under single control plane. The plugins are responsible for correct Hadoop
ecosystem configuration at already provisioned resources and use different
Hadoop management tools like Ambari to setup and configure all cluster
services, so, there are no actual provisioning configs on Savanna side in this
case. Savanna and its plugins encapsulate the knowledge of Hadoop internals and
default configuration for Hadoop services.

The next topic is “Cluster API”.

The concern that was raised is how to extract general clustering functionality
to the common library. Cluster provisioning and management topic currently
relevant for a number of projects within OpenStack ecosystem: Savanna, Trove,
TripleO, Heat, Taskflow.

Still each of the projects has their own understanding of what the cluster
provisioning is. The idea of extracting common functionality sounds reasonable,
but details still need to be worked out.

I’ll try to highlight Savanna team current perspective on this question. Notion
of “Cluster management” in my perspective has several levels:
1. Resources provisioning and configuration (like instances, networks,
storages). Heat is the main tool with possibly additional support from
underlying services. For example, instance grouping API extension [5] in Nova
would be very useful.
2. Distributed communication/task execution. There is a project in OpenStack
ecosystem with the mission to provide a framework for distributed task
execution - TaskFlow [6]. It’s been started quite recently. In Savanna we are
really looking forward to use more and more of its functionality in I and J
cycles as TaskFlow itself getting more mature.
3. Higher level clustering - management of the actual services working on top
of the infrastructure. For example, in Savanna configuring HDFS data nodes or
in Trove setting up MySQL cluster with Percona or Galera. This operations are
typical very specific for the project domain. As for Savanna specifically, we
use lots of benefits of Hadoop internals knowledge to deploy and configure it
properly.

Overall conclusion it seems to be that it make sense to enhance Heat
capabilities and invest in Taskflow development, leaving domain-specific
operations to the individual projects.

I also would like to emphasize that in Savanna Hadoop cluster management is
already implemented including scaling support.

With all this I do believe Savanna fills an important gap in OpenStack by
providing Data Processing capabilities in cloud environment in general and
integration with Hadoop ecosystem as the first particular step.

Hadoop ecosystem on its own is huge and integration will add significant value
to OpenStack community and users [7].

[1] http://eavesdrop.openstack.org/meetings/tc/2013/tc.2013-09-10-20.02.log.html
[2] http://www.youtube.com/watch?v=SrlHM0-q5zI
[3] https://blueprints.launchpad.net/savanna/+spec/infra-provisioning-extensions
[4]
https://blueprints.launchpad.net/savanna/+spec/heat-backed-resources-provisioning
[5] https://blueprints.launchpad.net/nova/+spec/instance-group-api-extension
[6] https://launchpad.net/taskflow
[7]http://www.google.com/trends/explore?q=openstack%2Chadoop#q=openstack%2C%20hadoop&cmpt=q

Sincerely yours,
Sergey Lukjanov
Savanna Technical Lead
Mirantis Inc.

_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

[openstack-dev] TC Meeting / Savanna Incubation Follow-Up

Reply via email to