Hi Renan,

Since you did a similar exercise using Go [1], it will be nice to see your 
feedback and guidance on the discussions Gourav is summarizing below. 


[1] - http://markmail.org/thread/ymj7yqvvbhrjwv3s 

> On Oct 17, 2016, at 11:32 PM, Shenoy, Gourav Ganesh <goshe...@indiana.edu> 
> wrote:
> Hi dev,
> Now that I have been able to get jobs scheduled via Aurora, I thought I 
> should summarize my understanding. I would also like to briefly draw out the 
> plan which I am working on with respect to using Mesos with Airavata.
> Apache Aurora:
> ·         Aurora, similar to Marathon & Chronos, is a service scheduler 
> framework for Mesos. It has been built for scheduling long running services & 
> cron jobs on Mesos.
> ·         The advantage with Aurora (over Marathon & Chronos) is that it 
> works well for one-off jobs as well – i.e. If I want to run a job and get the 
> output, Aurora is a better fit than Marathon & Chronos, since Marathon will 
> never let the job exit (and keep restarting it on slaves) & Chronos is ONLY 
> for crons.
> ·         Aurora also allows fine grained control of the jobs that need to be 
> submitted – the concept of jobs, tasks, processes – a job can consist of one 
> or more tasks, and a task can consist of one or more processes.
> ·         Aurora manages jobs that are made up of tasks; Mesos manages the 
> tasks that consist of processes; Thermos (is the Aurora executor) manages the 
> processes.
> ·         We can control resource utilization at task level because of the 
> above job abstractions that Aurora provides.
> ·         Among many other features, a useful one is the resource-quota 
> management for users & the ability to support multiple users to run jobs.
> Current focus:
> ·         I am currently working on building a Thrift based client for 
> Aurora, and have been successful in implementing one, but with limited 
> operations.
> ·         I will be adding support for more operations keeping them aligned 
> to Airavata job submission/monitoring requirements.
> ·         I am currently focusing on targeting Airavata deployment to Mesos 
> on a single cluster (eg: AWS). The flow would look like follows:
> <image001.png>
> ·         As you can see, currently there is just a single Mesos cluster. The 
> future focus would be to expand this to have multiple clusters.
> Subsequent work:
> ·         Once we are able to test Airavata deployment to single cluster 
> successfully, we can expand this to a multi-cluster environment.
> ·         Here we would multiple Mesos clusters which would somehow need to 
> be managed. But, the overall flow would look like follows:
> <image002.png>
> ·         We can either have multiple Mesos masters (for each individual 
> cluster), that are connected to each other via VPN, or have a single master – 
> in which case we would need to consider all other nodes as slaves.
> ·         This is a design issue which needs discussion, and Suresh has some 
> ideas on how to do this.
> Thanks and Regards,
> Gourav Shenoy
> From: Suresh Marru <sma...@apache.org>
> Reply-To: "dev@airavata.apache.org" <dev@airavata.apache.org>
> Date: Friday, October 7, 2016 at 11:43 PM
> To: Airavata Dev <dev@airavata.apache.org>
> Subject: Re: Mesos based meta-scheduling for Airavata
> Hi Gourav, 
> Thank you for the nice informative summaries, posts like these are always 
> educational. Keep’em coming. 
> Suresh
> On Oct 7, 2016, at 10:56 PM, Shenoy, Gourav Ganesh <goshe...@indiana.edu 
> <mailto:goshe...@indiana.edu>> wrote:
> Hi dev,
> I have been exploring different frameworks for Mesos which would help our 
> use-case of providing Airavata the capability to run jobs in a Mesos based 
> ecosystem. In particular, I have been playing around with Marathon & Chronos 
> and I am now going to be working on Apache Aurora. 
> I have summarized my understanding about Mesos, Marathon & Chronos below. I 
> will send out a separate email about Aurora later.
> Apache Mesos:
> ·         Apache Mesos is an open-source cluster manager, in the sense that 
> it helps deploy & manage different frameworks (or applications) in a large 
> clustered environment easily.
> ·         Mesos provides the ability to utilize underlying shared pool of 
> nodes as a single compute unit – That is, it can run many applications on 
> these nodes efficiently.
> ·         Mesos uses the concept of “offers” for scheduling and running jobs 
> on the underlying nodes. When a framework (application) wants to run 
> computations/jobs on the cluster, Mesos will decide how many resources it 
> will “offer” that framework based on the availability. The framework will 
> then decide which resources to use from the offer, and subsequently run the 
> computation/job on that resource.
> ·         In a typical cluster, you will have 3 or more Mesos masters & 
> multiple Mesos slaves. Multiple mesos masters help in providing high 
> availability – if one master goes down, Mesos will reelect a new leader 
> (master) – using Zookeeper.
> ·         The task mentioned above of providing “offers” to frameworks is 
> done by a master, whereas the slaves are the ones who run these computations.
> ·         Some additional points:
> o    I built a Mesos cluster with 3 masters & 2 slaves on EC2.
> o    Each master & slave have 1GB of RAM & 1vCPU with 20GB of disk space.
> Marathon:
> ·         Marathon is considered a framework that runs on top of Mesos. It is 
> a container orchestration platform for Mesos and essentially acts as a 
> service scheduler.
> ·         It is named “marathon” because it is intended for long running 
> applications. That is, Marathon makes sure that the service it is running 
> never stops – if a service goes down or the slave on which the service is run 
> dies, marathon keeps re-starting it on different slaves. 
> ·         In some sense Marathon is very good for ensuring high availability 
> of services. That is, instead of running services directly on Mesos, run it 
> in Marathon if you never want it to die.
> Note: You can decide to run a service on multiple slave nodes and if 
> resources on these slaves are available, Mesos will “offer” them to Marathon.
> ·         It is called a container orchestration platform because it 
> “launches” these services inside a container – either Docker OR Mesos 
> container.
> ·         In my opinion it is not a suitable “job scheduler” for Airavata 
> because in Airavata we need to run a job and get the output rather than 
> keeping it running always. Instead, we can run other schedulers – 
> chronos/aurora as a service in Marathon.
> Chronos:
> ·         Chronos is a Cron scheduler for Mesos. It is good for running 
> scheduled jobs – jobs that need to be run for a certain number of times, 
> repeatedly after certain intervals.
> ·         Chronos also provides the ability to add dependencies between jobs 
> – That is, if a job1 is dependent on another job2 then it will run job1 first 
> and then run job2 after job1 completes. It also builds a Directed Acyclic 
> Graph (DAG) based on these dependencies.
> ·         Similar to Marathon, Chronos receives “offers” from Mesos master 
> whenever it needs to run a job on Mesos.
> ·         Again, I found that Chronos does not fit the Airavata use-case 
> since I could not find a way to run one-off jobs via Chronos – you need to 
> specify interval time for Chronos, & Chronos then re-runs the job after that 
> interval is complete (even if you decide to specify num. of repetitions=1).
> Some additional points:
> ·         Marathon & Chronos both have REST API support – eg: you can submit 
> jobs via APIs along with other interactions such as list jobs, etc.
> ·         I installed Marathon & Chronos frameworks on the Mesos master 
> nodes. This is how their health looks like on the Mesos dashboard:
> <image002.png>
>                 As you can see, there are 3 active tasks running in Chronos & 
> 4 active tasks (long running) in Marathon.
> ·         I also installed Chronos as a service inside Marathon, and this is 
> how it looks like in the Marathon UI:
> <image004.png>
> Interestingly, Chronos (as a service in Marathon) was smart enough to 
> identify the jobs submitted via Chronos (as a framework on Mesos) & 
> vice-versa.
> ·         Also, Mesos dashboard lists the active tasks it is running & 
> details about which slave the task is running on. It also lists Completed 
> tasks. The “Sandbox” gives you access to the stdout/stderr files for the 
> tasks as well as any other directories that were created as part of the task.
> <image005.png>
> Pardon me for this long email. Next, I will explore Apache Aurora which seems 
> a better fit for Airavata use-case because it provides the features that 
> Chronos supports, as well as can run one-off jobs.
> Thanks and Regards,
> Gourav Shenoy
> From: "Shenoy, Gourav Ganesh" <goshe...@indiana.edu 
> <mailto:goshe...@indiana.edu>>
> Reply-To: "dev@airavata.apache.org <mailto:dev@airavata.apache.org>" 
> <dev@airavata.apache.org <mailto:dev@airavata.apache.org>>
> Date: Friday, September 23, 2016 at 4:43 PM
> To: "dev@airavata.apache.org <mailto:dev@airavata.apache.org>" 
> <dev@airavata.apache.org <mailto:dev@airavata.apache.org>>
> Subject: Mesos based meta-scheduling for Airavata
> Hi Dev,
> I am working on this project of building a Mesos based meta-scheduler for 
> Airavata, along with Shameera & Mangirish. Here is the jira 
> link:https://issues.apache.org/jira/browse/AIRAVATA-2082 
> <https://issues.apache.org/jira/browse/AIRAVATA-2082>.
> ·         We have identified some tasks that would be needed for achieving 
> this, and at the higher level it would consist of:
> 1.      Resource provisioning – We need to provision resources on cloud & hpc 
> infrastructures such as EC2, Jetstream, Comet, etc.
> 2.      Building a cluster – Deploying a Mesos cluster on set of nodes 
> obtained from (1) above for task management.
> 3.      Selecting a scheduler – We need to investigate the scheduler to use 
> with Mesos cluster. Some of the options are Marathon, Aurora. But we need to 
> find one that suits our needs of running serial as well as parallel (MPI) 
> jobs.
> 4.      Installing & running applications on this cluster – Once the cluster 
> has been deployed and a scheduler choice made, we need to be able to install 
> and run applications on this cluster using Airavata.
> ·         Until now we were able to look into the following:
> o   Resource provisioning:
> §  We explored several options of provisioning resources – using cloud 
> libraries as well as via ansible scripts.
> §  We built a OpenStack4J Java module which would provision instances on 
> OpenStack based clouds (eg: Jetstream).
> §  We also built a CloudBridge Python module for provisioning EC2 instances 
> on Amazon. CloudBridge can also be used to provision instances on OpenStack
> §  We wrote Ansible scripts for bringing up instances on both AWS and 
> OpenStack based clouds.
> §  Key Points: CloudBridge, OpenStack4J are powerful libraries for resource 
> provisioning, but currently they do single-instance provisioning, and not 
> support templated boot options such as CloudFormation (for AWS) & Heat (for 
> OpenStack).
> o   Building a cluster:
> §  We wrote Ansible script for deploying a Mesos-Marathon cluster on a set of 
> nodes. This script will install necessary dependencies such as Zookeeper.
> §  We tested this on OpenStack based clouds & on EC2.
> §  OpenStack Magnum provides excellent support for doing resource 
> provisioning & deploying mesos cluster, but we are running into some problems 
> while trying it.
> o   Installing a scheduler:
> §  Our Ansible script is currently installing Marathon as the scheduler on 
> Mesos. We haven’t yet submitted jobs using Marathon.
> ·         Although not finalized, but we are inclined towards using Ansible 
> approach for the above, as Ansible also provides Python APIs and which will 
> allow us to integrate it with Airavata via Thrift. Hence we will be able to 
> easily invoke the Ansible scripts from code without needing to use the 
> command-line interface.
> ·         We are also progressively working on some work-items such as:
> o   Exploring options to provision and deploy a Mesos-Marathon cluster on HPC 
> systems such as Comet. The challenge would be to use Ansible to provision 
> resources and deploy the cluster. Once we have a cluster, we can try running 
> applications.
> o   Exploring different scheduler options for running serial and parallel 
> (MPI) jobs on such heterogeneous clusters.
> o   Exploring orchestration options such as OpenStack Heat, AWS 
> CloudFormation, OpenStack Magnum, etc.
> Any suggestions and comments are highly appreciated.
> Thanks and Regards,
> Gourav Shenoy

Reply via email to