Re: mesos and moving jobs between clusters

Shenoy, Gourav Ganesh Wed, 26 Oct 2016 09:24:33 -0700

Hi Mark,

Sorry for responding late, but Pankaj & Mangirish have already summarized some 
very good options for the problem you mentioned. I am not sure if you already 
have, but I would recommend taking a look at Aurora – a job scheduler framework 
for Mesos.


The scenario you mentioned, where there are compute resources, some with & 
without GPU and delegating jobs according to the resource requirements. Well, 
Aurora does this intelligently by detecting the availability of resources (cpu, 
gpu, ram, etc) on target slaves, based on the job needs; and runs the job on 
that resource. It also provides the ability to set resource quotas for specific 
users that submit jobs. Overall, they provide a rich set of features.

Thanks and Regards,
Gourav Shenoy

From: "Miller, Mark" <[email protected]>
Reply-To: "[email protected]" <[email protected]>
Date: Wednesday, October 26, 2016 at 11:11 AM
To: "[email protected]" <[email protected]>
Subject: RE: mesos and moving jobs between clusters

Hi Folks,

Thanks for your kind answers.
We are very specifically interested in how Mesos might allow us to submit to 
multiple machines without changing the “rules” we impose that are machine 
dependent.
For example, Gordon has 16 cores per node and no gpu
Comet has 24 cores per node, and 4 gpu nodes.
So we adjust our running rules depending on the resource.

I can imagine one solution would be to find the least common denominator, and 
run all jobs
On virtual clusters with 16 cores maximum, and no gpus.
Or we might only submit jobs that can use gpus to resources that have them.

Anyway, since we will all be in SD soon, it seems like we should chat a bit 
about this in person?.
Maybe we can find a coffee break time, or set up a meeting during the meeting?

Mark




From: Pankaj Saha [mailto:[email protected]]
Sent: Tuesday, October 25, 2016 3:10 PM
To: dev <[email protected]>
Subject: Re: mesos and moving jobs between clusters


Hi Mark,


Mesos collects the resource information from all the nodes in the cluster 
(cores, memory, disk, and gpu) and presents a unified view, as if it is a 
single operating system. The Mesosphere, who a commercial entity for Mesos, has 
built an ecosystem around Mesos as the kernel called the "Data Center Operating 
System (DCOS)".  Frameworks interact  with Mesos to reserve resources and then 
use these resources to run jobs on the cluster. So, for example, if multiple 
frameworks such as Marathon, Apache Aurora, and a custom-MPI-framework are 
using Mesos, then there is a negotiation between Mesos and each framework on 
how many resources each framework gets. Once the framework, say Aurora, gets 
resources, it can decide how to use those resources. Some of the strengths of 
Mesos include fault tolerance at scale and the ability to co-schedule 
applications/frameworks on the cluster such that cluster utilization is high.


Mesos off-the-shelf only works when the Mater and agent nodes have a line of 
communication to each other. We have worked on modifying the Mesos installation 
so that it even works when agents are behind firewalls on campus clusters. We 
are also working on getting the same setup to work on Jetstream and Chameleon 
where allocations are a mix of public IPs and internally accessible nodes. This 
will allow us to use Mesos to meta-schedule across clusters. We are also 
developing our own framework, to be able to customize scheduling and resource 
negotiations for science gateways on Mesos clusters. Our plan is to work with 
Suresh and Marlon's team so that it works with Airavata.


I will be presenting at the Gateways workshop in November, and then I will also 
be at SC along with my adviser (Madhu Govindaraju), if you would like to 
discuss any of these projects.


We are working on packaging our work so that it can be shared with this 
community.



Thanks

Pankaj

On Tue, Oct 25, 2016 at 11:36 AM, Mangirish Wagle 
<[email protected]<mailto:[email protected]>> wrote:
Hi Mark,

Thanks for your question. So if I understand you correctly, you need kind of 
load balancing between identical clusters through a single Mesos master?

With the current setup, from what I understand, we have a separate mesos 
masters for every cluster on separate clouds. However, its a good investigative 
topic if we can have single mesos master targeting multiple identical clusters. 
We have some work ongoing to use a virtual cluster setup with compute resources 
across clouds to install mesos, but not sure if that is what you are looking 
for.

Regards,
Mangirish





On Tue, Oct 25, 2016 at 11:05 AM, Miller, Mark 
<[email protected]<mailto:[email protected]>> wrote:
Hi all,

I posed a question to Suresh (see below), and he asked me to put this question 
on the dev list.
So here it is. I will be grateful for any comments about the issues you all are 
facing, and what has come up in trying this, as
It seems likely that this is a much simpler problem in concept than it is in 
practice, but its solution has many benefits.

Here is my question:
A group of us have been discussing how we might simplify submitting jobs to 
different compute resources in our current implementation of CIPRES, and how 
cloud computing might facilitate this. But none of us are cloud experts. As I 
understand it, the mesos cluster that I have been seeing in the Airavata email 
threads is intended to make it possible to deploy jobs to multiple virtual 
clusters. I am (we are) wondering if Mesos manages submissions to identical 
virtual clusters on multiple machines, and if that works efficiently.

In our implementation, we have to change the rules to run efficiently on 
different machines, according to gpu availability, and cores per node. I am 
wondering how Mesos/ virtual clusters affect those considerations.
Can mesos create basically identical virtual clusters independent of machine?

Thanks for any advice.

Mark

Re: mesos and moving jobs between clusters

Reply via email to