I have used MPI briefly a couple of years ago, and from what I remember: MPI tends to require so-called gang scheduling where all instances of a job are scheduled simultaneously. Due to lacking inherent fault tolerance of MPI, it is common to abort the entire job (i.e. all instances) if a single instance fails. Furthermore, native MPI/HPC schedulers tend to support long queues with various fairness mechanisms in order to make the gang scheduling efficient.
In contrast, Aurora makes the assumption that individual instances of a job can be scheduled and fail independently. This implies that you would need some external scaffolding to ensure proper gang scheduling. (Disclaimer: I have no idea how difficult this would be) Aurora is battle-tested. Using it as a backend of HPC/MPI scheduler could therefore be worthwhile if you manage to make the scaffolding work. In particular, because writing a scalable and fault-tolerant Mesos framework can be quite difficult. Best Regards, Stephan On Sa, 2016-10-15 at 12:47 -0400, Mangirish Wagle wrote: > Hi Santhosh, > > Thanks for your response and suggestion. Mesos-hydra is not being > used and > supported by the community anymore, from what I heard from Mesos > developers. But certainly it may be a potential reference to build up > upon. > > My most preferred option would be to use any existing schedulers like > Apache Aurora to run MPI. If you have any insights on that, that > would be > really helpful. > > Regards, > Mangirish > > On Sat, Oct 15, 2016 at 11:07 AM, Santhosh Kumar Shanmugham < > sshanmug...@twitter.com.invalid> wrote: > > > > > Have you checked out https://github.com/mesosphere/mesos-hydra? > > > > On Oct 14, 2016 6:08 PM, "Mangirish Wagle" <vaglomangir...@gmail.co > > m> > > wrote: > > > > > > > > Thanks for your response Zameer. I shall check out Apache Aurora > > > and > > update > > > > > > if it served the purpose. > > > > > > On Fri, Oct 14, 2016 at 2:01 PM, Zameer Manji <zma...@apache.org> > > > wrote: > > > > > > > > > > > Hey, > > > > > > > > I am not an expert on MPI jobs, but it seems possible to run > > > > them on > > > > Aurora. Aurora is a pretty flexible scheduler that lets you run > > arbitrary > > > > > > > > > > > binaries or container images. Aurora is designed for long > > > > running > > > services > > > > > > > > and assuming that you want to launch workers that are long > > > > running, it > > > > could solve your problem. > > > > > > > > On Thu, Oct 13, 2016 at 11:12 PM, Mangirish Wagle < > > > > vaglomangir...@gmail.com> > > > > wrote: > > > > > > > > > > > > > > Hello Aurora Devs, > > > > > > > > > > I am contributing to Apache Airavata <http://airavata.apache. > > > > > org/> > > and > > > > > > > > > > > > > > > > > currently working on extending the support for the science > > > > > gateways > > to > > > > > > > > > > > run > > > > > > > > > > MPI jobs on cloud based Mesos clusters. > > > > > > > > > > Is there a way I can achieve this using Apache Aurora? I > > > > > would really > > > > > appreciate if you could share info on any work already being > > > > > done to > > > > > achieve scheduling MPI jobs on Mesos. > > > > > > > > > > Thank you. > > > > > > > > > > Best Regards, > > > > > Mangirish Wagle > > > > > Graduate Student, Indiana University Bloomington > > > > > > > > > > -- > > > > > Zameer Manji > > > > > > > > > > > > > >