I have used MPI briefly a couple of years ago, and from what I
remember: 

MPI tends to require so-called gang scheduling where all instances of a
job are scheduled simultaneously. Due to lacking inherent fault
tolerance of MPI, it is common to abort the entire job (i.e. all
instances) if a single instance fails. Furthermore, native MPI/HPC
schedulers tend to support long queues with various fairness mechanisms
in order to make the gang scheduling efficient.

In contrast, Aurora makes the assumption that individual instances of a
job can be scheduled and fail independently. This implies that you
would need some external scaffolding to ensure proper gang scheduling.
(Disclaimer: I have no idea how difficult this would be)

Aurora is battle-tested. Using it as a backend of HPC/MPI scheduler
could therefore be worthwhile if you manage to make the scaffolding
work. In particular, because writing a scalable and fault-tolerant
Mesos framework can be quite difficult.

Best Regards,
Stephan  


On Sa, 2016-10-15 at 12:47 -0400, Mangirish Wagle wrote:
> Hi Santhosh,
> 
> Thanks for your response and suggestion. Mesos-hydra is not being
> used and
> supported by the community anymore, from what I heard from Mesos
> developers. But certainly it may be a potential reference to build up
> upon.
> 
> My most preferred option would be to use any existing schedulers like
> Apache Aurora to run MPI. If you have any insights on that, that
> would be
> really helpful.
> 
> Regards,
> Mangirish
> 
> On Sat, Oct 15, 2016 at 11:07 AM, Santhosh Kumar Shanmugham <
> sshanmug...@twitter.com.invalid> wrote:
> 
> > 
> > Have you checked out https://github.com/mesosphere/mesos-hydra?
> > 
> > On Oct 14, 2016 6:08 PM, "Mangirish Wagle" <vaglomangir...@gmail.co
> > m>
> > wrote:
> > 
> > > 
> > > Thanks for your response Zameer. I shall check out Apache Aurora
> > > and
> > update
> > > 
> > > if it served the purpose.
> > > 
> > > On Fri, Oct 14, 2016 at 2:01 PM, Zameer Manji <zma...@apache.org>
> > > wrote:
> > > 
> > > > 
> > > > Hey,
> > > > 
> > > > I am not an expert on MPI jobs, but it seems possible to run
> > > > them on
> > > > Aurora. Aurora is a pretty flexible scheduler that lets you run
> > arbitrary
> > > 
> > > > 
> > > > binaries or container images. Aurora is designed for long
> > > > running
> > > services
> > > > 
> > > > and assuming that you want to launch workers that are long
> > > > running, it
> > > > could solve your problem.
> > > > 
> > > > On Thu, Oct 13, 2016 at 11:12 PM, Mangirish Wagle <
> > > > vaglomangir...@gmail.com>
> > > > wrote:
> > > > 
> > > > > 
> > > > > Hello Aurora Devs,
> > > > > 
> > > > > I am contributing to Apache Airavata <http://airavata.apache.
> > > > > org/>
> > and
> > > 
> > > > 
> > > > > 
> > > > > currently working on extending the support for the science
> > > > > gateways
> > to
> > > 
> > > > 
> > > > run
> > > > > 
> > > > > MPI jobs on cloud based Mesos clusters.
> > > > > 
> > > > > Is there a way I can achieve this using Apache Aurora? I
> > > > > would really
> > > > > appreciate if you could share info on any work already being
> > > > > done to
> > > > > achieve scheduling MPI jobs on Mesos.
> > > > > 
> > > > > Thank you.
> > > > > 
> > > > > Best Regards,
> > > > > Mangirish Wagle
> > > > > Graduate Student, Indiana University Bloomington
> > > > > 
> > > > > --
> > > > > Zameer Manji
> > > > > 
> > > > 
> > > 
> > 

Reply via email to