Hi Marlon, Thanks for confirming and sharing the legal link. -Mangirish
On Thu, Oct 13, 2016 at 12:13 PM, Pierce, Marlon <[email protected]> wrote: > BSD is ok: https://www.apache.org/legal/resolved. > > > > *From: *Mangirish Wagle <[email protected]> > *Reply-To: *"[email protected]" <[email protected]> > *Date: *Thursday, October 13, 2016 at 12:03 PM > *To: *"[email protected]" <[email protected]> > *Subject: *Re: Running MPI jobs on Mesos based clusters > > > > Hello Devs, > > I needed some advice on the license of the MPI libraries. The MPICH > library that I have been trying claims to have a "BSD Like" license ( > http://git.mpich.org/mpich.git/blob/HEAD:/COPYRIGHT). > > I am aware that OpenMPI which uses BSD license is currently used in our > application. I had chosen to start investigating MPICH because it claims to > be a highly portable and high quality implementation of latest MPI > standard, suitable to cloud based clusters. > > If anyone could please advise on the acceptance of the MPICH libraries MSD > Like license for ASF, that would help. > > Thank you. > > Best Regards, > > Mangirish Wagle > > > > On Thu, Oct 6, 2016 at 1:48 AM, Mangirish Wagle <[email protected]> > wrote: > > Hello Devs, > > > > The network issue mentioned above now stands resolved. The problem was > with the iptables had some conflicting rules which blocked the traffic. It > was resolved by simple iptables flush. > > > > Here is the test MPI program running on multiple machines:- > > > > [centos@mesos-slave-1 ~]$ mpiexec -f machinefile -n 2 ./mpitest > > Hello world! I am process number: 0 on host mesos-slave-1 > > Hello world! I am process number: 1 on host mesos-slave-2 > > > > The next step is to try invoking this through framework like Marathon. > However, the job submission still does not run through Marathon. It seems > to gets stuck in the 'waiting' state forever (For example > http://149.165.170.245:8080/ui/#/apps/%2Fmaw-try). Further, I notice that > Marathon is listed under 'inactive frameworks' in mesos dashboard ( > http://149.165.171.33:5050/#/frameworks). > > > > I am trying to get this working, though any help/ clues with this would be > really helpful. > > > > Thanks and Regards, > > Mangirish Wagle > > > > > On Fri, Sep 30, 2016 at 9:21 PM, Mangirish Wagle <[email protected]> > wrote: > > Hello Devs, > > > > I am currently running a sample MPI C program using 'mpiexec' provided by > MPICH. I followed their installation guide > <http://www.mpich.org/static/downloads/3.2/mpich-3.2-installguide.pdf> to > install the libraries on the master and slave nodes of the mesos cluster. > > > > The approach that I am trying out here is that I am equipping the > underlying nodes with MPI handling tools and then use the Mesos framework > like Marathon/ Aurora to submit jobs to run MPI programs by invoking these > tools. > > > > You can potentially run an MPI program using mpiexec in the following > manner:- > > > > # *mpiexec -f machinefile -n 2 ./mpitest* > > - *machinefile *-> File which contains an inventory of machines to run > the program on and number of processes on each machine. > - *mpitest *-> MPI program compiled in C using mpicc compiler. The > program returns the process number and he hostname of the machine running > the process. > - *-n *option indicates number of processes that it needs to spawn > > Example of machinefile contents:- > > > > # Entries in the format <hostname/IP>:<number of processes> > > mesos-slave-1:1 > > mesos-slave-2:1 > > > > The reason for choosing slaves is that Mesos runs the jobs on slaves, > managed by 'agents' pertaining to the slaves. > > > > Output of the program with '-n 1':- > > > > # mpiexec -f machinefile -n 1 ./mpitest > > Hello world! I am process number: 0 on host mesos-slave-1 > > > > But when I try for '-n 2', I am hitting the following error:- > > > > # mpiexec -f machinefile -n 2 ./mpitest > > [proxy:0:1@mesos-slave-2] HYDU_sock_connect (/home/centos/mpich-3.2/src/ > pm/hydra/utils/sock/sock.c:172): unable to connect from "mesos-slave-2" > to "mesos-slave-1" (No route to host) > > [proxy:0:1@mesos-slave-2] main (/home/centos/mpich-3.2/src/ > pm/hydra/pm/pmiserv/pmip.c:189): *unable to connect to server > mesos-slave-1 at port 44788* (check for firewalls!) > > > > It seems to not allow the program execution due to network traffic being > blocked. I checked security groups in scigap openstack for mesos-slave-1, > mesos-slave-2 nodes and it is set to 'wideopen' policy. Furthermore, I > tried adding explicit rules to the policies to allow all TCP and UDP > (Currently I am not sure what protocol is used underneath), even then it > continues throwing this error. > > > > Any clues, suggestions, comments about the error or approach as a whole > would be helpful. > > > > Thanks and Regards, > > Mangirish Wagle > > > > *Error! Filename not specified.* > > > > On Tue, Sep 27, 2016 at 11:23 AM, Mangirish Wagle < > [email protected]> wrote: > > Hello Devs, > > > > Thanks Gourav and Shameera for all the work w.r.t. setting up the > Mesos-Marathon cluster on Jetstream. > > > > I am currently evaluating MPICH (http://www.mpich.org/about/overview/) to > be used for launching MPI jobs on top of mesos. MPICH version 1.2 supports > Mesos based MPI scheduling. I have been also trying to submit jobs to the > cluster through Marathon. However, in either cases I am currently facing > issues which I am working to get resolved. > > > > I am compiling my notes into the following google doc. You may please > review and let me know your comments, suggestions. > > > > https://docs.google.com/document/d/1p_Y4Zd4I4lgt264IHspXJli3la25y6bc > PcmrTD6nR8g/edit?usp=sharing > > > > Thanks and Regards, > > Mangirish Wagle > > > > *Error! Filename not specified.* > > > > On Wed, Sep 21, 2016 at 3:20 PM, Shenoy, Gourav Ganesh < > [email protected]> wrote: > > Hi Mangirish, > > > > I have set up a Mesos-Marathon cluster for you on Jetstream. I will share > with you with the cluster details in a separate email. Kindly note that > there are 3 masters & 2 slaves in this cluster. > > > > I am also working on automating this process for Jetstream (similar to > Shameera’s ansible script for EC2) and when that is ready, we can create > clusters or add/remove slave machines from the cluster. > > > > Thanks and Regards, > > Gourav Shenoy > > > > *From: *Mangirish Wagle <[email protected]> > *Reply-To: *"[email protected]" <[email protected]> > *Date: *Wednesday, September 21, 2016 at 2:36 PM > *To: *"[email protected]" <[email protected]> > *Subject: *Running MPI jobs on Mesos based clusters > > > > Hello All, > > > > I would like to post for everybody's awareness about the study that I am > undertaking this fall, i.e. to evaluate various different frameworks that > would facilitate MPI jobs on Mesos based clusters for Apache Airavata. > > > > Some of the options that I am looking at are:- > > 1. MPI support framework bundled with Mesos > 2. Apache Aurora > 3. Marathon > 4. Chronos > > Some of the evaluation criteria that I am planning to base my > investigation are:- > > - Ease of setup > - Documentation > - Reliability features like HA > - Scaling and Fault recovery > - Performance > - Community Support > > Gourav and Shameera are working on ansible based automation to spin up a > mesos based cluster and I am planning to use it to setup a cluster for > experimentation. > > > > Any suggestions or information about prior work on this would be highly > appreciated. > > > > Thank you. > > > > Best Regards, > > Mangirish Wagle > > *Error! Filename not specified.* > > > > > > > > >
