Hi folks,

Initial discussions of Savanna Incubation request have been started yesterday. 
Two major topics being discussed were Heat integration and “clustering library” 
[1].

To start with let me give a brief overview of key Savanna features:
1. Provisioning of underlying OpenStack resources (like compute, volume, 
network) required for Hadoop cluster.
2. Hadoop cluster deployment and configuration.
3. Integration with different Hadoop distributions through plugin mechanism 
with single control plan for all of them. In future can be used to integrate 
with other Data Processing frameworks, for example, Twitter Storm.
4. Reliability and performance optimizations to ensure Hadoop cluster 
performance on top of OpenStack, like enabling Swift to be used as underlying 
HDFS and exposing information on Swift data locality to Hadoop scheduler.
5. Set of Elastic Data Processing features:
  * Hadoop jobs on-demand execution
  * Pool of different external data sources, like Swift, external Hadoop 
cluster, NoSQL and traditional databases
  * Pig and Hive integration
6. OpenStack Dashboard plugin for all above.

I highly recommend to view our screencast about Savanna 0.2 release (mid July) 
[2] to better understand Savanna functionality. 

As you can see, resources provisioning is just one of the features and the 
implementation details are not critical for overall architecture. It performs 
only the first step of the cluster setup. We’ve been considering Heat for a 
while, but ended up direct API calls in favor of speed and simplicity. Going 
forward Heat integration will be done by implementing extension mechanism [3] 
and [4] as part of Icehouse release.

The next part, Hadoop cluster configuration, already extensible and we have 
several plugins - Vanilla, Hortonworks Data Platform and Cloudera plugin 
started too. This allow to unify management of different Hadoop distributions 
under single control plane. The plugins are responsible for correct Hadoop 
ecosystem configuration at already provisioned resources and use different 
Hadoop management tools like Ambari to setup and configure all cluster  
services, so, there are no actual provisioning configs on Savanna side in this 
case. Savanna and its plugins encapsulate the knowledge of Hadoop internals and 
default configuration for Hadoop services.



The next topic is “Cluster API”.

The concern that was raised is how to extract general clustering functionality 
to the common library. Cluster provisioning and management topic currently 
relevant for a number of projects within OpenStack ecosystem: Savanna, Trove, 
TripleO, Heat, Taskflow.

Still each of the projects has their own understanding of what the cluster 
provisioning is. The idea of extracting common functionality sounds reasonable, 
but details still need to be worked out. 

I’ll try to highlight Savanna team current perspective on this question. Notion 
of “Cluster management” in my perspective has several levels:
1. Resources provisioning and configuration (like instances, networks, 
storages). Heat is the main tool with possibly additional support from 
underlying services. For example, instance grouping API extension [5] in Nova 
would be very useful. 
2. Distributed communication/task execution. There is a project in OpenStack 
ecosystem with the mission to provide a framework for distributed task 
execution - TaskFlow [6]. It’s been started quite recently. In Savanna we are 
really looking forward to use more and more of its functionality in I and J 
cycles as TaskFlow itself getting more mature.
3. Higher level clustering - management of the actual services working on top 
of the infrastructure. For example, in Savanna configuring HDFS data nodes or 
in Trove setting up MySQL cluster with Percona or Galera. This operations are 
typical very specific for the project domain. As for Savanna specifically, we 
use lots of benefits of Hadoop internals knowledge to deploy and configure it 
properly.

Overall conclusion it seems to be that it make sense to enhance Heat 
capabilities and invest in Taskflow development, leaving domain-specific 
operations to the individual projects.

I also would like to emphasize that in Savanna Hadoop cluster management is 
already implemented including scaling support.

With all this I do believe Savanna fills an important gap in OpenStack by 
providing Data Processing capabilities in cloud environment in general and 
integration with Hadoop ecosystem as the first particular step. 

Hadoop ecosystem on its own is huge and integration will add significant value 
to OpenStack community and users [7].


[1] http://eavesdrop.openstack.org/meetings/tc/2013/tc.2013-09-10-20.02.log.html
[2] http://www.youtube.com/watch?v=SrlHM0-q5zI
[3] https://blueprints.launchpad.net/savanna/+spec/infra-provisioning-extensions
[4] 
https://blueprints.launchpad.net/savanna/+spec/heat-backed-resources-provisioning
[5] https://blueprints.launchpad.net/nova/+spec/instance-group-api-extension
[6] https://launchpad.net/taskflow
[7]http://www.google.com/trends/explore?q=openstack%2Chadoop#q=openstack%2C%20hadoop&cmpt=q

Sincerely yours,
Sergey Lukjanov
Savanna Technical Lead
Mirantis Inc.


_______________________________________________
OpenStack-dev mailing list
[email protected]
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to