Re: Running MPI jobs on Mesos based clusters

Mangirish Wagle Tue, 18 Oct 2016 08:51:12 -0700

Sure Suresh, will update my findings on the mailing list. Thanks!

On Tue, Oct 18, 2016 at 7:59 AM, Suresh Marru <sma...@apache.org> wrote:


> Hi Mangirish,
>
> This is interesting. Looking forward to see what you will find our further
> on gang scheduling support. Since the compute nodes are getting bigger,
> even if you can explore single node MPI (on Jetstream using 22 cores) that
> will help.
>
> Suresh
>
> P.S. Good to see the momentum on mailing list discussions on such topics.
>
> On Oct 18, 2016, at 1:54 AM, Mangirish Wagle <vaglomangir...@gmail.com>
> wrote:
>
> Hello Devs,
>
> Here is an update on some new learnings and thoughts based on my
> interactions with Mesos and Aurora devs.
>
> MPI implementations in Mesos repositories (like MPI Hydra) rely on
> obsolete MPI platforms and no longer supported my the developer community.
> Hence it is not recommended that we use this for our purpose.
>
> One of the known ways of running MPI jobs over mesos is using "gang
> scheduling" which is basically distributing the MPI run over multiple jobs
> on mesos in place of multiple nodes. The challenge here is the jobs need to
> be scheduled as one task and any job errored should collectively error out
> the main program including all the distributed jobs.
>
> One of the Mesos developer (Niklas Nielsen) pointed me out to his work on
> gang scheduling: https://github.com/nqn. This code may not be fully
> tested but certainly a good starting point to explore gang scheduling.
>
> One of the Aurora developer (Stephen Erb) suggests using gang scheduling
> on top of Aurora. Aurora scheduler assumes that every job is independent.
> Hence, there would be a need to develop some external scaffolding to
> coordinate and schedule these jobs, which might not be trivial. One
> advantage of using Aurora as a backend for gang scheduling is that we would
> inherit the robustness of Aurora, which otherwise would be a key challenge
> if targeting bare mesos.
>
> Alternative to all the options above, I think we should probably be able
> to run a 1 node MPI job through Aurora. A resource offer with CPUs and
> Memory from Mesos is abstracted as a single runtime, but is mapped to
> multiple nodes underneath, which eventually would exploit distributed
> resource capabilities.
>
> I intend to try out the 1 node MPI job submission approach first and
> simultaneously explore the gang scheduling approach.
>
> Please let me know your thoughts/ suggestions.
>
> Best Regards,
> Mangirish
>
>
>
> On Thu, Oct 13, 2016 at 12:39 PM, Mangirish Wagle <
> vaglomangir...@gmail.com> wrote:
>
>> Hi Marlon,
>> Thanks for confirming and sharing the legal link.
>>
>> -Mangirish
>>
>> On Thu, Oct 13, 2016 at 12:13 PM, Pierce, Marlon <marpi...@iu.edu> wrote:
>>
>>> BSD is ok: https://www.apache.org/legal/resolved.
>>>
>>>
>>>
>>> *From: *Mangirish Wagle <vaglomangir...@gmail.com>
>>> *Reply-To: *"dev@airavata.apache.org" <dev@airavata.apache.org>
>>> *Date: *Thursday, October 13, 2016 at 12:03 PM
>>> *To: *"dev@airavata.apache.org" <dev@airavata.apache.org>
>>> *Subject: *Re: Running MPI jobs on Mesos based clusters
>>>
>>>
>>>
>>> Hello Devs,
>>>
>>> I needed some advice on the license of the MPI libraries. The MPICH
>>> library that I have been trying claims to have a "BSD Like" license (
>>> http://git.mpich.org/mpich.git/blob/HEAD:/COPYRIGHT).
>>>
>>> I am aware that OpenMPI which uses BSD license is currently used in our
>>> application. I had chosen to start investigating MPICH because it claims to
>>> be a highly portable and high quality implementation of latest MPI
>>> standard, suitable to cloud based clusters.
>>>
>>> If anyone could please advise on the acceptance of the MPICH libraries
>>> MSD Like license for ASF, that would help.
>>>
>>> Thank you.
>>>
>>> Best Regards,
>>>
>>> Mangirish Wagle
>>>
>>>
>>>
>>> On Thu, Oct 6, 2016 at 1:48 AM, Mangirish Wagle <
>>> vaglomangir...@gmail.com> wrote:
>>>
>>> Hello Devs,
>>>
>>>
>>>
>>> The network issue mentioned above now stands resolved. The problem was
>>> with the iptables had some conflicting rules which blocked the traffic. It
>>> was resolved by simple iptables flush.
>>>
>>>
>>>
>>> Here is the test MPI program running on multiple machines:-
>>>
>>>
>>>
>>> [centos@mesos-slave-1 ~]$ mpiexec -f machinefile -n 2 ./mpitest
>>>
>>> Hello world!  I am process number: 0 on host mesos-slave-1
>>>
>>> Hello world!  I am process number: 1 on host mesos-slave-2
>>>
>>>
>>>
>>> The next step is to try invoking this through framework like Marathon.
>>> However, the job submission still does not run through Marathon. It seems
>>> to gets stuck in the 'waiting' state forever (For example
>>> http://149.165.170.245:8080/ui/#/apps/%2Fmaw-try). Further, I notice
>>> that Marathon is listed under 'inactive frameworks' in mesos dashboard (
>>> http://149.165.171.33:5050/#/frameworks).
>>>
>>>
>>>
>>> I am trying to get this working, though any help/ clues with this would
>>> be really helpful.
>>>
>>>
>>>
>>> Thanks and Regards,
>>>
>>> Mangirish Wagle
>>>
>>>
>>>
>>>
>>> On Fri, Sep 30, 2016 at 9:21 PM, Mangirish Wagle <
>>> vaglomangir...@gmail.com> wrote:
>>>
>>> Hello Devs,
>>>
>>>
>>>
>>> I am currently running a sample MPI C program using 'mpiexec' provided
>>> by MPICH. I followed their installation guide
>>> <http://www.mpich.org/static/downloads/3.2/mpich-3.2-installguide.pdf> to
>>> install the libraries on the master and slave nodes of the mesos cluster.
>>>
>>>
>>>
>>> The approach that I am trying out here is that I am equipping the
>>> underlying nodes with MPI handling tools and then use the Mesos framework
>>> like Marathon/ Aurora to submit jobs to run MPI programs by invoking these
>>> tools.
>>>
>>>
>>>
>>> You can potentially run an MPI program using mpiexec in the following
>>> manner:-
>>>
>>>
>>>
>>> # *mpiexec -f machinefile -n 2 ./mpitest*
>>>
>>>    - *machinefile *-> File which contains an inventory of machines to
>>>    run the program on and number of processes on each machine.
>>>    - *mpitest *-> MPI program compiled in C using mpicc compiler. The
>>>    program returns the process number and he hostname of the machine running
>>>    the process.
>>>    - *-n *option indicates number of processes that it needs to spawn
>>>
>>> Example of machinefile contents:-
>>>
>>>
>>>
>>> # Entries in the format <hostname/IP>:<number of processes>
>>>
>>> mesos-slave-1:1
>>>
>>> mesos-slave-2:1
>>>
>>>
>>>
>>> The reason for choosing slaves is that Mesos runs the jobs on slaves,
>>> managed by 'agents' pertaining to the slaves.
>>>
>>>
>>>
>>> Output of the program with '-n 1':-
>>>
>>>
>>>
>>> # mpiexec -f machinefile -n 1 ./mpitest
>>>
>>> Hello world!  I am process number: 0 on host mesos-slave-1
>>>
>>>
>>>
>>> But when I try for '-n 2', I am hitting the following error:-
>>>
>>>
>>>
>>> # mpiexec -f machinefile -n 2 ./mpitest
>>>
>>> [proxy:0:1@mesos-slave-2] HYDU_sock_connect
>>> (/home/centos/mpich-3.2/src/pm/hydra/utils/sock/sock.c:172): unable to
>>> connect from "mesos-slave-2" to "mesos-slave-1" (No route to host)
>>>
>>> [proxy:0:1@mesos-slave-2] main 
>>> (/home/centos/mpich-3.2/src/pm/hydra/pm/pmiserv/pmip.c:189):
>>> *unable to connect to server mesos-slave-1 at port 44788* (check for
>>> firewalls!)
>>>
>>>
>>>
>>> It seems to not allow the program execution due to network traffic being
>>> blocked. I checked security groups in scigap openstack for mesos-slave-1,
>>> mesos-slave-2 nodes and it is set to 'wideopen' policy. Furthermore, I
>>> tried adding explicit rules to the policies to allow all TCP and UDP
>>> (Currently I am not sure what protocol is used underneath), even then it
>>> continues throwing this error.
>>>
>>>
>>>
>>> Any clues, suggestions, comments about the error or approach as a whole
>>> would be helpful.
>>>
>>>
>>>
>>> Thanks and Regards,
>>>
>>> Mangirish Wagle
>>>
>>>
>>>
>>> *Error! Filename not specified.*
>>>
>>>
>>>
>>> On Tue, Sep 27, 2016 at 11:23 AM, Mangirish Wagle <
>>> vaglomangir...@gmail.com> wrote:
>>>
>>> Hello Devs,
>>>
>>>
>>>
>>> Thanks Gourav and Shameera for all the work w.r.t. setting up the
>>> Mesos-Marathon cluster on Jetstream.
>>>
>>>
>>>
>>> I am currently evaluating MPICH (http://www.mpich.org/about/overview/)
>>> to be used for launching MPI jobs on top of mesos. MPICH version 1.2
>>> supports Mesos based MPI scheduling. I have been also trying to submit jobs
>>> to the cluster through Marathon. However, in either cases I am currently
>>> facing issues which I am working to get resolved.
>>>
>>>
>>>
>>> I am compiling my notes into the following google doc. You may please
>>> review and let me know your comments, suggestions.
>>>
>>>
>>>
>>> https://docs.google.com/document/d/1p_Y4Zd4I4lgt264IHspXJli3
>>> la25y6bcPcmrTD6nR8g/edit?usp=sharing
>>>
>>>
>>>
>>> Thanks and Regards,
>>>
>>> Mangirish Wagle
>>>
>>>
>>>
>>> *Error! Filename not specified.*
>>>
>>>
>>>
>>> On Wed, Sep 21, 2016 at 3:20 PM, Shenoy, Gourav Ganesh <
>>> goshe...@indiana.edu> wrote:
>>>
>>> Hi Mangirish,
>>>
>>>
>>>
>>> I have set up a Mesos-Marathon cluster for you on Jetstream. I will
>>> share with you with the cluster details in a separate email. Kindly note
>>> that there are 3 masters & 2 slaves in this cluster.
>>>
>>>
>>>
>>> I am also working on automating this process for Jetstream (similar to
>>> Shameera’s ansible script for EC2) and when that is ready, we can create
>>> clusters or add/remove slave machines from the cluster.
>>>
>>>
>>>
>>> Thanks and Regards,
>>>
>>> Gourav Shenoy
>>>
>>>
>>>
>>> *From: *Mangirish Wagle <vaglomangir...@gmail.com>
>>> *Reply-To: *"dev@airavata.apache.org" <dev@airavata.apache.org>
>>> *Date: *Wednesday, September 21, 2016 at 2:36 PM
>>> *To: *"dev@airavata.apache.org" <dev@airavata.apache.org>
>>> *Subject: *Running MPI jobs on Mesos based clusters
>>>
>>>
>>>
>>> Hello All,
>>>
>>>
>>>
>>> I would like to post for everybody's awareness about the study that I am
>>> undertaking this fall, i.e. to evaluate various different frameworks that
>>> would facilitate MPI jobs on Mesos based clusters for Apache Airavata.
>>>
>>>
>>>
>>> Some of the options that I am looking at are:-
>>>
>>>    1. MPI support framework bundled with Mesos
>>>    2. Apache Aurora
>>>    3. Marathon
>>>    4. Chronos
>>>
>>> Some of the evaluation criteria that I am planning to base my
>>> investigation are:-
>>>
>>>    - Ease of setup
>>>    - Documentation
>>>    - Reliability features like HA
>>>    - Scaling and Fault recovery
>>>    - Performance
>>>    - Community Support
>>>
>>> Gourav and Shameera are working on ansible based automation to spin up a
>>> mesos based cluster and I am planning to use it to setup a cluster for
>>> experimentation.
>>>
>>>
>>>
>>> Any suggestions or information about prior work on this would be highly
>>> appreciated.
>>>
>>>
>>>
>>> Thank you.
>>>
>>>
>>>
>>> Best Regards,
>>>
>>> Mangirish Wagle
>>>
>>> *Error! Filename not specified.*
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>
>

Re: Running MPI jobs on Mesos based clusters

Reply via email to