Updated Branches: refs/heads/master 4054a8e2a -> 2238e2bbc
[GSOC] Meng's mid-term report Signed-off-by: Sebastien Goasguen <run...@gmail.com> Project: http://git-wip-us.apache.org/repos/asf/cloudstack/repo Commit: http://git-wip-us.apache.org/repos/asf/cloudstack/commit/2238e2bb Tree: http://git-wip-us.apache.org/repos/asf/cloudstack/tree/2238e2bb Diff: http://git-wip-us.apache.org/repos/asf/cloudstack/diff/2238e2bb Branch: refs/heads/master Commit: 2238e2bbc5863ce926341b2d91fd4bc8bca1c873 Parents: 4054a8e Author: Meng Han <meng...@ufl.edu> Authored: Mon Jul 29 11:26:17 2013 -0400 Committer: Sebastien Goasguen <run...@gmail.com> Committed: Mon Jul 29 11:26:17 2013 -0400 ---------------------------------------------------------------------- docs/en-US/gsoc-midsummer-meng.xml | 196 +++++++++++++++++++++- docs/en-US/images/clusterDefinition.png | Bin 0 -> 52607 bytes docs/en-US/images/launchHadoopClusterApi.png | Bin 0 -> 13427 bytes docs/en-US/images/launchHadoopClusterCmd.png | Bin 0 -> 83972 bytes docs/en-US/images/whirrDependency.png | Bin 0 -> 10794 bytes docs/en-US/images/whirrOutput.png | Bin 0 -> 61831 bytes 6 files changed, 192 insertions(+), 4 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/cloudstack/blob/2238e2bb/docs/en-US/gsoc-midsummer-meng.xml ---------------------------------------------------------------------- diff --git a/docs/en-US/gsoc-midsummer-meng.xml b/docs/en-US/gsoc-midsummer-meng.xml index 1ab07cb..ee24cf4 100644 --- a/docs/en-US/gsoc-midsummer-meng.xml +++ b/docs/en-US/gsoc-midsummer-meng.xml @@ -11,9 +11,9 @@ to you under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at - + http://www.apache.org/licenses/LICENSE-2.0 - + Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY @@ -23,6 +23,194 @@ --> <section id="gsoc-midsummer-meng"> - <title>Mid-Summer Progress Updates</title> - <para>This section describes ...</para> + <title>Mid-Summer Progress Updates for Meng - "Hadoop Provisioning on Cloudstack Via Whirr"</title> + <para></para> + <para>In this section I describe my progress with the project titled "Hadoop Provisioning on CloudStack Via Whirr"</para> + <section id="introduction-meng"> + <title>Introduction</title> + <para> + It has been five weeks since the GSOC 2013 is kick-started. During the last five weeks I have been constantly learning from the CloudStack Community in aspects of both knowledge and personality. The whole community is very accommodating and willing to help newbies. I am making progress steadily with the community's help. This is my first experience working with such a large and cool code base, definitely a challenging and wonderful experience for me. Though I am a little slipped behind my schedule, I am making my best effort and hoping to complete what I set out in my proposal by the end of this summer. + </para> + <para> + + </para> + </section> +<section id="cloudstack-installation-meng"> + <title>CloudStack Installation</title> + <para> + I spent two weeks or so on the CloudStack Installation. In the beginning, I am using the Ubuntu systems. Given that I am not familiar with maven and a little scared by various kinds of errors and exceptions during system deployment, I failed to deploy CloudStack through building from the source. With Ian's advice, I switched to CentOS and began to use rpm packages for installation, things went much smoother. By the end of the second week, I submitted my first patch -- CloudStack_4.1_Quick_Install_Guide. + </para> + + </section> + <section id="whirrr-installation"> + <title>Deploying a Hadoop Cluster on CloudStack via Whirr</title> + + <para> Provided that CloudStack is in place and I can register templates and add instances, I went ahead to use Whirr to deploy a hadoop cluster on CloudStack. The cluster definition file is as follows:</para> + <mediaobject> + <imageobject> + <imagedata fileref="./images/clusterDefinition.png"/> + </imageobject> + + </mediaobject> + +<para> <emphasis role="bold">whirr.cluster-name: </emphasis>the name of your hadoop cluster.</para> +<para> <emphasis role="bold">whirr.store-cluster-in-etc-hosts: </emphasis>store all cluster IPs and hostnames in /etc/hosts on each node.</para> +<para> <emphasis role="bold"> whirr.instance-templates: </emphasis> this specifies your cluster layout. One node acts as the jobtracker and namenode (the hadoop master). Another two slaves nodes act as both datanode and tasktracker.</para> +<para> <emphasis role="bold"> image-id: </emphasis> This tells CloudStack which template to use to start the cluster. </para> +<para> <emphasis role="bold"> hardware-id: </emphasis> This is the type of hardware to use for the cluster instances. +</para> +<para> <emphasis role="bold"> private/public-key-file: </emphasis>:the key-pair used to login to each instance. Only RSA SSH keys are supported at this moment. Jclouds will move this key pair to the set of instances on startup. </para> +<para> <emphasis role="bold"> whirr.cluster-user: </emphasis>this is the name of the cluster admin user.</para> +<para> <emphasis role="bold"> whirr.bootstrap-user: </emphasis>this tells Jclouds which user name and password to use to login to each instance for bootstrapping and customizing each instance. You must specify this property if the image you choose has a hardwired username/password.(e.g. the default template CentOS 5.5(64-bit) no GUI (KVM) comes with Cloudstack has a hardcoded credential: root:password), otherwise you don't need to specify this property.</para> +<para> <emphasis role="bold"> whirr.env.repo: </emphasis>this tells Whirr which repository to use to download packages.</para> +<para> <emphasis role="bold"> whirr.hadoop.install-function/whirr.hadoop.configure-function </emphasis> :it's self-explanatory.</para> + + + + <para> + Output of this deployment is as follows: + </para> + <mediaobject> + <imageobject> + <imagedata fileref="./images/whirrOutput.png"/> + </imageobject> + + </mediaobject> + + <para> + Other details can be found at <ulink url="http://kyrameng.blogspot.com/2013/07/how-to-use-whirr-to-start-hadoop.html"><citetitle>this post</citetitle></ulink> in my blog. In addition I have a Whirr trouble shooting <ulink url="http://kyrameng.blogspot.com/2013/07/whirr-trouble-shooting.html"><citetitle>post</citetitle></ulink> there if you are interested. + </para> + </section> + <section id="emr-plugin-implementation"> + <title>Elastic Map Reduce(EMR) Plugin Implementation</title> + <para> + Given that I have completed the deployment of a hadoop cluster on CloudStack using Whirr through the above steps, I began to dive into the EMR plugin development. My first API is launchHadoopCluster, it's implementation is quite straight forward, by invoking an external Whirr command in the command line on the management server and piggybacking the Whirr output in responses.This api has a structure like below: </para> + <mediaobject> + <imageobject> + <imagedata fileref="./images/launchHadoopClusterApi.png"/> + </imageobject> + </mediaobject> +<para>The following is the source code of launchHadoopClusterCmd.java.</para> +<mediaobject> + <imageobject> + <imagedata fileref="./images/launchHadoopClusterCmd.png"/> + </imageobject> + + </mediaobject> + <para>You can invoke this api through the following command in CloudMonkey:</para> + <programlisting>> launchHadoopCluster config=myhadoop.properties</programlisting> +<para> </para> +<para>This is sort of the launchHadoopCluster 0.0, other details can be found in this <ulink url="http://kyrameng.blogspot.com/2013/07/cloudstack-emr-api-developement-series.html"><citetitle>post</citetitle></ulink> .</para> +<para> +My undergoing working is modifying this api so that it calls Whirr libraries instead of invoking Whirr externally in the command line.</para> +<para>First add Whirr as a dependency of this plugin so that maven will download Whirr automatically when you compile this plugin.</para> +<mediaobject> + <imageobject> + <imagedata fileref="./images/whirrDependency.png"/> + </imageobject> + + </mediaobject> + +<para>I am planning to replace the Runtime.getRuntime().exec() above with the following code snippet.</para> +<programlisting language="Java"> + LaunchClusterCommand command = new LaunchClusterCommand(); + command.run(System.in, System.out, System.err, Arrays.asList(args)); +</programlisting> +<para></para> +<para>Eventually when a hadoop cluster is launched. We can use Yarn to submit hadoop jobs. +Yarn exposes the following API for job submission.</para> +<programlisting>ApplicationId submitApplication(ApplicationSubmissionContext appContext) throws org.apache.hadoop.yarn.exceptions.YarnRemoteException</programlisting> +<para>In Yarn, an application is either a single job in the classical sense of Map-Reduce or a DAG of jobs. In other words an application can have many jobs. This fits well with the concepts in EMR design. The term job flow in EMR is equivalent to the application concept in Yarn. Correspondingly, a job flow step in EMR is equal to a job in Yarn. In addition Yarn exposes the following API to query the state of an application.</para> +<programlisting>ApplicationReport getApplicationReport(ApplicationId appId) throws org.apache.hadoop.yarn.exceptions.YarnRemoteException</programlisting> +<para>The above API can be used to implement the DescribeJobFlows API in EMR. </para> + + + + + </section> + <section id="learning-jclouds"> + <title>Learning Jclouds</title> +<para>As Whirr relies on Jclouds for clouds provisioning, it's important for me to understand what Jclouds features support Whirr and how Whirr interacts with Jclouds. I figured out the following problems:</para> +<itemizedlist> +<listitem><para>How does Whirr create user credentials on each node? </para> +<para> +Using the runScript feature provide by Jclouds, Whirr can execute a script at node bootup, one of the options in the script is to override the login credentials with the ones that provide in the cluster properties file. The following line from Whirr demonstrates this idea. +<programlisting language="Java">final RunScriptOptions options = overrideLoginCredentials(LoginCredentials.builder().user(clusterSpec.getClusterUser()).privateKey(clusterSpec.getPrivateKey()).build()); +</programlisting> +</para><para> </para> +</listitem> +<listitem><para>How does Whirr start up instances in the beginning? </para> +<para>The computeService APIs provided by jclouds allow Whirr to create a set of nodes in a group(specified by the cluster name),and operate them as a logical unit without worrying about the implementation details of the cloud. </para> +<programlisting language="Java">Set<NodeMetadata> nodes = (Set<NodeMetadata>)computeService.createNodesInGroup(clusterName, num, template); +</programlisting><para> </para><para>The above command returns all the nodes the API was able to launch into in a running state with port 22 open.</para></listitem> +<listitem><para>How does Whirr differentiate nodes by roles and configure them separately? </para> +<para>Jclouds commands ending in Matching are called predicate commands. They allow Whirr to decide which subset of nodes these commands will affect. For example, the following command in Whirr will run a script with specified options on nodes who match the given condition.</para> +<programlisting language="Java"> +Predicate<NodeMetadata> condition; +condition = Predicates.and(runningInGroup(spec.getClusterName()), condition); +ComputeServiceContext context = getCompute().apply(spec); +context.getComputeService().runScriptOnNodesMatching(condition,statement, options); +</programlisting> +<para>The following is an example how a node playing the role of jobtracker in a hadoop cluster is configured to open certain ports using the predicate commands.</para> +<programlisting language="Java"> + Instance jobtracker = cluster.getInstanceMatching(role(ROLE)); // ROLE="hadoop-jobtracker" + event.getFirewallManager().addRules( + Rule.create() + .destination(jobtracker) + .ports(HadoopCluster.JOBTRACKER_WEB_UI_PORT), + Rule.create() + .source(HadoopCluster.getNamenodePublicAddress(cluster).getHostAddress()) + .destination(jobtracker) + .ports(HadoopCluster.JOBTRACKER_PORT) + ); + +</programlisting> +<para> </para> +<para>With the help of such predicated commands, Whirr can run different bootstrap and init scripts on nodes with distinct roles.</para> + +</listitem> + + +</itemizedlist> + +</section> + <section id="Lessons"> + <title>Great Lessons Learned</title> + <para> + I am much appreciated with the opportunity to work with CloudStack and learn from the lovable community. I can see myself constantly evolving from this invaluable experience both technologically and psychologically. There were hard times that I were stuck on certain problems for days and good times that made me want to scream seeing problem cleared. This project is a great challenge for me. I am making progress steadily though not smoothly. That's where I learned the following great lessons: + + + </para> + <itemizedlist> + <listitem> + <para>When you work in an open source community, do things in the open source way. There was a time when I locked myself up because I am stuck on problems and I am not confident enough to ask them on the mailing list. The more I restricted myself from the community the less progress I made. Also the lack of communication from my side also prevents me from learning from other people and get guidance from my mentor.</para> + </listitem> + <listitem> + <para>CloudStack is evolving at a fast pace. There are many APIs being added ,many patches being submitted every day. That's why the community use the word "SNAPSHOT" for each version. At this moment I am learning to deal with fast code changing and upgrading. A large portion of my time is devoted to system installation and deployment. I am getting used to treat system exceptions and errors as a common case. That's another reason why communication with the community is critical. </para> + + </listitem> + + <listitem> + <para>In addition to the project itself, I am strengthening my technical suite at the same time. </para> + +<itemizedlist> +<listitem><para>I learned to use some useful software tools: maven, git, publican, etc.</para></listitem> + <listitem> + <para> +Reading the source code of Whirr make me learn more high level java programming skills, e.g. using generics, wildcard, service loader, the Executor model, Future object, etc .</para> + </listitem> + <listitem> + <para>I am exposed to Jclouds, a useful cloud neutral library to manipulate different cloud infrastructures.</para> + </listitem> + <listitem><para>I gained deeper understanding of cloud web services and learned the usage of several cloud clients, e.g. Jclouds CLI, CloudMonkey,etc.</para></listitem> + </itemizedlist> + + + </listitem> + + + </itemizedlist> + + <para>I am grateful that Google Summer Of Code exists, it gives us students a sense of how fast real-world software development works and provides us hand-on experience of coding in large open source projects. More importantly it's a self-challenging process that strengthens our minds along the way.</para> + </section> </section> http://git-wip-us.apache.org/repos/asf/cloudstack/blob/2238e2bb/docs/en-US/images/clusterDefinition.png ---------------------------------------------------------------------- diff --git a/docs/en-US/images/clusterDefinition.png b/docs/en-US/images/clusterDefinition.png new file mode 100644 index 0000000..6170f9f Binary files /dev/null and b/docs/en-US/images/clusterDefinition.png differ http://git-wip-us.apache.org/repos/asf/cloudstack/blob/2238e2bb/docs/en-US/images/launchHadoopClusterApi.png ---------------------------------------------------------------------- diff --git a/docs/en-US/images/launchHadoopClusterApi.png b/docs/en-US/images/launchHadoopClusterApi.png new file mode 100644 index 0000000..6f94c74 Binary files /dev/null and b/docs/en-US/images/launchHadoopClusterApi.png differ http://git-wip-us.apache.org/repos/asf/cloudstack/blob/2238e2bb/docs/en-US/images/launchHadoopClusterCmd.png ---------------------------------------------------------------------- diff --git a/docs/en-US/images/launchHadoopClusterCmd.png b/docs/en-US/images/launchHadoopClusterCmd.png new file mode 100644 index 0000000..66a0c75 Binary files /dev/null and b/docs/en-US/images/launchHadoopClusterCmd.png differ http://git-wip-us.apache.org/repos/asf/cloudstack/blob/2238e2bb/docs/en-US/images/whirrDependency.png ---------------------------------------------------------------------- diff --git a/docs/en-US/images/whirrDependency.png b/docs/en-US/images/whirrDependency.png new file mode 100644 index 0000000..acdec78 Binary files /dev/null and b/docs/en-US/images/whirrDependency.png differ http://git-wip-us.apache.org/repos/asf/cloudstack/blob/2238e2bb/docs/en-US/images/whirrOutput.png ---------------------------------------------------------------------- diff --git a/docs/en-US/images/whirrOutput.png b/docs/en-US/images/whirrOutput.png new file mode 100644 index 0000000..7c3b512 Binary files /dev/null and b/docs/en-US/images/whirrOutput.png differ