Updated Branches:
  refs/heads/master 4054a8e2a -> 2238e2bbc

[GSOC] Meng's mid-term report

Signed-off-by: Sebastien Goasguen <run...@gmail.com>


Project: http://git-wip-us.apache.org/repos/asf/cloudstack/repo
Commit: http://git-wip-us.apache.org/repos/asf/cloudstack/commit/2238e2bb
Tree: http://git-wip-us.apache.org/repos/asf/cloudstack/tree/2238e2bb
Diff: http://git-wip-us.apache.org/repos/asf/cloudstack/diff/2238e2bb

Branch: refs/heads/master
Commit: 2238e2bbc5863ce926341b2d91fd4bc8bca1c873
Parents: 4054a8e
Author: Meng Han <meng...@ufl.edu>
Authored: Mon Jul 29 11:26:17 2013 -0400
Committer: Sebastien Goasguen <run...@gmail.com>
Committed: Mon Jul 29 11:26:17 2013 -0400

----------------------------------------------------------------------
 docs/en-US/gsoc-midsummer-meng.xml           | 196 +++++++++++++++++++++-
 docs/en-US/images/clusterDefinition.png      | Bin 0 -> 52607 bytes
 docs/en-US/images/launchHadoopClusterApi.png | Bin 0 -> 13427 bytes
 docs/en-US/images/launchHadoopClusterCmd.png | Bin 0 -> 83972 bytes
 docs/en-US/images/whirrDependency.png        | Bin 0 -> 10794 bytes
 docs/en-US/images/whirrOutput.png            | Bin 0 -> 61831 bytes
 6 files changed, 192 insertions(+), 4 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/cloudstack/blob/2238e2bb/docs/en-US/gsoc-midsummer-meng.xml
----------------------------------------------------------------------
diff --git a/docs/en-US/gsoc-midsummer-meng.xml 
b/docs/en-US/gsoc-midsummer-meng.xml
index 1ab07cb..ee24cf4 100644
--- a/docs/en-US/gsoc-midsummer-meng.xml
+++ b/docs/en-US/gsoc-midsummer-meng.xml
@@ -11,9 +11,9 @@
  to you under the Apache License, Version 2.0 (the
  "License"); you may not use this file except in compliance
  with the License.  You may obtain a copy of the License at
- 
+
    http://www.apache.org/licenses/LICENSE-2.0
- 
+
  Unless required by applicable law or agreed to in writing,
  software distributed under the License is distributed on an
  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
@@ -23,6 +23,194 @@
 -->
 
 <section id="gsoc-midsummer-meng">
-    <title>Mid-Summer Progress Updates</title>
-    <para>This section describes ...</para>
+    <title>Mid-Summer Progress Updates for Meng - "Hadoop Provisioning on 
Cloudstack Via Whirr"</title>
+       <para></para>
+    <para>In this section I describe my progress with the project titled 
"Hadoop Provisioning on CloudStack Via Whirr"</para>
+    <section id="introduction-meng">
+        <title>Introduction</title>
+        <para>
+            It has been five weeks since the GSOC 2013 is kick-started. During 
the last five weeks I have been constantly learning from the CloudStack 
Community in aspects of both knowledge and personality. The whole community is 
very accommodating and willing to help newbies. I am making progress steadily 
with the community's help. This is my first experience working with such a 
large and cool code base, definitely a challenging and wonderful experience for 
me. Though I am a little slipped behind my schedule, I am making my best effort 
and hoping to complete what I set out in my proposal by the end of this summer.
+        </para>
+        <para>
+            
+        </para>
+    </section>
+<section id="cloudstack-installation-meng">
+        <title>CloudStack Installation</title>
+        <para>
+           I spent two weeks or so on the CloudStack Installation. In the 
beginning, I am using the Ubuntu systems. Given that I am not familiar with 
maven and a little scared by various kinds of errors and exceptions during 
system deployment, I failed to deploy CloudStack through building from the 
source. With Ian's advice, I switched to CentOS and began to use rpm packages 
for installation, things went much smoother. By the end of the second week, I 
submitted my first patch -- CloudStack_4.1_Quick_Install_Guide.
+        </para>
+       
+    </section>
+    <section id="whirrr-installation">
+        <title>Deploying a Hadoop Cluster on CloudStack via Whirr</title>
+        
+         <para>   Provided that CloudStack is in place and I can register 
templates and add instances, I went ahead to use Whirr to deploy a hadoop 
cluster on CloudStack. The cluster definition file is as follows:</para>
+ <mediaobject>
+            <imageobject>
+                <imagedata fileref="./images/clusterDefinition.png"/>
+            </imageobject>
+           
+        </mediaobject>
+
+<para> <emphasis role="bold">whirr.cluster-name:     </emphasis>the name of 
your hadoop cluster.</para>
+<para> <emphasis role="bold">whirr.store-cluster-in-etc-hosts:     
</emphasis>store all cluster IPs and hostnames in /etc/hosts on each 
node.</para>
+<para> <emphasis role="bold">         whirr.instance-templates:     
</emphasis> this specifies your cluster layout. One node acts as the jobtracker 
and namenode (the hadoop master). Another two slaves nodes act as both datanode 
and tasktracker.</para>
+<para> <emphasis role="bold">   image-id:           </emphasis> This tells 
CloudStack which template to use to start the cluster. </para>
+<para> <emphasis role="bold">         hardware-id:     </emphasis> This is the 
type of hardware to use for the cluster instances. 
+</para>
+<para> <emphasis role="bold">         private/public-key-file:     
</emphasis>:the key-pair used to login to each instance. Only RSA SSH keys are 
supported at this moment. Jclouds will move this key pair to the set of 
instances on startup. </para>
+<para> <emphasis role="bold">         whirr.cluster-user:     </emphasis>this 
is the name of the cluster admin user.</para>
+<para> <emphasis role="bold">         whirr.bootstrap-user:     
</emphasis>this tells Jclouds which user name and password to use to login to 
each instance for bootstrapping and customizing each instance. You must specify 
this property if the image you choose has a hardwired username/password.(e.g. 
the default template CentOS 5.5(64-bit) no GUI (KVM) comes with Cloudstack has 
a hardcoded credential: root:password), otherwise you don't need to specify 
this property.</para>
+<para> <emphasis role="bold">         whirr.env.repo:     </emphasis>this 
tells Whirr which repository to use to download packages.</para>
+<para> <emphasis role="bold">         
whirr.hadoop.install-function/whirr.hadoop.configure-function     </emphasis>  
:it's self-explanatory.</para>
+
+        
+      
+        <para>
+            Output of this deployment is as follows:
+        </para>
+ <mediaobject>
+            <imageobject>
+                <imagedata fileref="./images/whirrOutput.png"/>
+            </imageobject>
+          
+        </mediaobject>
+       
+        <para>
+             Other details can be found at  <ulink 
url="http://kyrameng.blogspot.com/2013/07/how-to-use-whirr-to-start-hadoop.html";><citetitle>this
 post</citetitle></ulink> in my blog. In addition I have a Whirr trouble 
shooting <ulink 
url="http://kyrameng.blogspot.com/2013/07/whirr-trouble-shooting.html";><citetitle>post</citetitle></ulink>
 there  if you are interested.
+        </para>
+    </section>
+    <section id="emr-plugin-implementation">
+        <title>Elastic Map Reduce(EMR) Plugin Implementation</title>
+        <para>
+           Given that I have completed the deployment of a hadoop cluster on 
CloudStack using Whirr through the above steps, I began to dive into the EMR 
plugin development. My first API is launchHadoopCluster, it's implementation is 
quite straight forward, by invoking an external Whirr command in the command 
line on the management server and piggybacking the Whirr output in 
responses.This api has a structure like below:    </para>
+       <mediaobject>
+            <imageobject>
+                <imagedata fileref="./images/launchHadoopClusterApi.png"/>
+            </imageobject>       
+        </mediaobject>
+<para>The following is the source code of launchHadoopClusterCmd.java.</para>
+<mediaobject>
+            <imageobject>
+                <imagedata fileref="./images/launchHadoopClusterCmd.png"/>
+            </imageobject>
+          
+        </mediaobject>
+ <para>You can invoke this api through the following command in 
CloudMonkey:</para>
+ <programlisting>> launchHadoopCluster 
config=myhadoop.properties</programlisting>
+<para>  </para>
+<para>This is sort of the launchHadoopCluster 0.0, other details can be found 
in this <ulink 
url="http://kyrameng.blogspot.com/2013/07/cloudstack-emr-api-developement-series.html";><citetitle>post</citetitle></ulink>
 .</para>
+<para>
+My undergoing working is modifying this api so that it calls Whirr libraries 
instead of invoking Whirr externally in the command line.</para>
+<para>First add Whirr as a dependency of this plugin so that maven will 
download Whirr automatically when you compile this plugin.</para>
+<mediaobject>
+            <imageobject>
+                <imagedata fileref="./images/whirrDependency.png"/>
+            </imageobject>
+          
+        </mediaobject>
+
+<para>I am planning to replace the Runtime.getRuntime().exec() above  with the 
following code snippet.</para>
+<programlisting language="Java">
+ LaunchClusterCommand command = new LaunchClusterCommand();
+ command.run(System.in, System.out, System.err, Arrays.asList(args));
+</programlisting>
+<para></para>
+<para>Eventually when a hadoop cluster is launched. We can use Yarn to submit 
hadoop jobs. 
+Yarn exposes the following API for job submission.</para>
+<programlisting>ApplicationId submitApplication(ApplicationSubmissionContext 
appContext) throws 
org.apache.hadoop.yarn.exceptions.YarnRemoteException</programlisting>
+<para>In Yarn, an application is either a single job in the classical sense of 
Map-Reduce or a DAG of jobs. In other words an application can have many jobs. 
This fits well with the concepts in EMR design. The term job flow  in EMR is 
equivalent to the application concept in Yarn. Correspondingly, a job flow step 
in EMR is equal to a job in Yarn. In addition Yarn exposes the following API to 
query the state of an application.</para>
+<programlisting>ApplicationReport getApplicationReport(ApplicationId appId) 
throws org.apache.hadoop.yarn.exceptions.YarnRemoteException</programlisting>
+<para>The above API can be used to implement the DescribeJobFlows API in EMR. 
</para>
+
+    
+              
+      
+    </section>
+  <section id="learning-jclouds">
+  <title>Learning Jclouds</title>
+<para>As Whirr relies on Jclouds for clouds provisioning, it's important for 
me to understand what Jclouds features support Whirr and how Whirr interacts 
with Jclouds. I figured out the following problems:</para>
+<itemizedlist>
+<listitem><para>How does Whirr create user credentials on each node? </para>
+<para>
+Using the runScript feature provide by Jclouds, Whirr can execute a script at 
node bootup, one of the options in the script is to override the login 
credentials with the ones that provide in the cluster properties file. The 
following line from Whirr demonstrates this idea.
+<programlisting language="Java">final RunScriptOptions options = 
overrideLoginCredentials(LoginCredentials.builder().user(clusterSpec.getClusterUser()).privateKey(clusterSpec.getPrivateKey()).build());
+</programlisting>
+</para><para> </para>
+</listitem>
+<listitem><para>How does Whirr start up instances in the beginning? </para>
+<para>The computeService APIs provided by jclouds allow Whirr to create a set 
of nodes in a group(specified by the cluster name),and operate them as a 
logical unit without worrying about the implementation details of the cloud. 
</para>
+<programlisting language="Java">Set&lt;NodeMetadata&gt; nodes = 
(Set&lt;NodeMetadata&gt;)computeService.createNodesInGroup(clusterName, num, 
template);
+</programlisting><para> </para><para>The above command returns all the nodes 
the API was able to launch into in a running state with port 22 
open.</para></listitem>
+<listitem><para>How does Whirr differentiate nodes by roles and configure them 
separately? </para>
+<para>Jclouds commands ending in Matching are called predicate commands. They 
allow Whirr to decide which subset of nodes these commands will affect. For 
example, the following command in Whirr will run a script with specified 
options on nodes who match the given condition.</para>
+<programlisting language="Java"> 
+Predicate&lt;NodeMetadata&gt; condition;
+condition = Predicates.and(runningInGroup(spec.getClusterName()), condition);
+ComputeServiceContext context = getCompute().apply(spec);
+context.getComputeService().runScriptOnNodesMatching(condition,statement, 
options);
+</programlisting>
+<para>The following is an example how a node playing the role of jobtracker in 
a hadoop cluster is configured to open certain ports using the predicate 
commands.</para>
+<programlisting language="Java">
+    Instance jobtracker = cluster.getInstanceMatching(role(ROLE)); //  
ROLE="hadoop-jobtracker"
+    event.getFirewallManager().addRules(
+        Rule.create()
+          .destination(jobtracker)
+          .ports(HadoopCluster.JOBTRACKER_WEB_UI_PORT),
+        Rule.create()
+          
.source(HadoopCluster.getNamenodePublicAddress(cluster).getHostAddress())
+          .destination(jobtracker)
+          .ports(HadoopCluster.JOBTRACKER_PORT)
+    );
+
+</programlisting>
+<para> </para>
+<para>With the help of such predicated commands, Whirr can run different 
bootstrap and init scripts on nodes with distinct roles.</para>
+
+</listitem>
+
+
+</itemizedlist>
+
+</section>
+    <section id="Lessons">
+        <title>Great Lessons Learned</title>
+        <para>
+            I am much appreciated with the opportunity to work with CloudStack 
and learn from the lovable community. I can see myself constantly evolving from 
this invaluable experience both technologically and psychologically. There were 
hard times that I were stuck on certain problems for days and good times that 
made me want to scream seeing problem cleared. This project is a great 
challenge for me. I am making progress steadily though not smoothly. That's 
where I learned the following great lessons:
+        
+    
+        </para>
+        <itemizedlist>
+            <listitem>
+                <para>When you work in an open source community, do things in 
the open source way. There was a time when I locked myself up because I am 
stuck on problems and I am not confident enough to ask them on the mailing 
list. The more I restricted myself from the community the less progress I made. 
Also the lack of communication from my side also prevents me from learning from 
other people and get guidance from my mentor.</para>
+            </listitem>
+            <listitem>
+                <para>CloudStack is evolving at a fast pace. There are many 
APIs being added ,many patches being submitted every day. That's why the 
community use the word "SNAPSHOT" for each version. At this moment I am 
learning to deal with fast code changing and upgrading. A large portion of my 
time is devoted to system installation and deployment. I am getting used to 
treat system exceptions and errors as a common case. That's another reason why 
communication with the community is critical. </para>
+
+            </listitem>
+
+            <listitem>
+                <para>In addition to the project itself, I am strengthening my 
technical suite at the same time. </para>
+
+<itemizedlist>
+<listitem><para>I learned to use some useful software tools: maven, git, 
publican, etc.</para></listitem>
+            <listitem>
+                <para>
+Reading the source code of Whirr make me learn more high level java 
programming skills, e.g. using generics, wildcard, service loader, the Executor 
model, Future object, etc .</para>
+            </listitem>
+            <listitem>
+                <para>I am exposed to Jclouds, a useful cloud neutral library 
to manipulate different cloud infrastructures.</para>
+            </listitem>
+         <listitem><para>I gained deeper understanding of cloud web services 
and learned the usage of several cloud clients, e.g. Jclouds CLI, 
CloudMonkey,etc.</para></listitem>
+        </itemizedlist>
+    
+
+            </listitem>
+          
+ 
+        </itemizedlist>
+             
+        <para>I am grateful that Google Summer Of Code exists, it gives us 
students a sense of how fast real-world software development works and provides 
us hand-on experience of coding in large open source projects. More importantly 
it's a self-challenging process that strengthens our minds along the way.</para>
+    </section>
 </section>

http://git-wip-us.apache.org/repos/asf/cloudstack/blob/2238e2bb/docs/en-US/images/clusterDefinition.png
----------------------------------------------------------------------
diff --git a/docs/en-US/images/clusterDefinition.png 
b/docs/en-US/images/clusterDefinition.png
new file mode 100644
index 0000000..6170f9f
Binary files /dev/null and b/docs/en-US/images/clusterDefinition.png differ

http://git-wip-us.apache.org/repos/asf/cloudstack/blob/2238e2bb/docs/en-US/images/launchHadoopClusterApi.png
----------------------------------------------------------------------
diff --git a/docs/en-US/images/launchHadoopClusterApi.png 
b/docs/en-US/images/launchHadoopClusterApi.png
new file mode 100644
index 0000000..6f94c74
Binary files /dev/null and b/docs/en-US/images/launchHadoopClusterApi.png differ

http://git-wip-us.apache.org/repos/asf/cloudstack/blob/2238e2bb/docs/en-US/images/launchHadoopClusterCmd.png
----------------------------------------------------------------------
diff --git a/docs/en-US/images/launchHadoopClusterCmd.png 
b/docs/en-US/images/launchHadoopClusterCmd.png
new file mode 100644
index 0000000..66a0c75
Binary files /dev/null and b/docs/en-US/images/launchHadoopClusterCmd.png differ

http://git-wip-us.apache.org/repos/asf/cloudstack/blob/2238e2bb/docs/en-US/images/whirrDependency.png
----------------------------------------------------------------------
diff --git a/docs/en-US/images/whirrDependency.png 
b/docs/en-US/images/whirrDependency.png
new file mode 100644
index 0000000..acdec78
Binary files /dev/null and b/docs/en-US/images/whirrDependency.png differ

http://git-wip-us.apache.org/repos/asf/cloudstack/blob/2238e2bb/docs/en-US/images/whirrOutput.png
----------------------------------------------------------------------
diff --git a/docs/en-US/images/whirrOutput.png 
b/docs/en-US/images/whirrOutput.png
new file mode 100644
index 0000000..7c3b512
Binary files /dev/null and b/docs/en-US/images/whirrOutput.png differ

Reply via email to