Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-08 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 09/05/14 00:16, Joshua Ladd wrote:

> The necessary packages will be supported and available in community
> OFED.

We're constrained to what is in RHEL6 I'm afraid.

This is because we have to run GPFS over IB to BG/Q from the same NSDs
that talk GPFS to all our Intel clusters.   We did try MOFED 2.x (in
connected mode) on a new Intel cluster during its bring up last year
which worked for MPI but stopped it talking to the NSDs.  Reverting to
vanilla RHEL6 fixed it.

Not your problem though. :-)  As Ralph has said there is work on an
alternative solution that we will be able to use.

Thanks!
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlNsG88ACgkQO2KABBYQAh8+SwCfZWpViBFwuhlxqERXpbXbr8Eq
awwAnjj7NJ2/zUGBeZNT0UPwkmaGOaLR
=nPxl
-END PGP SIGNATURE-


Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-08 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 08/05/14 23:45, Ralph Castain wrote:

> Artem and I are working on a new PMIx plugin that will resolve it 
> for non-Mellanox cases.

Ah yes of course, sorry my bad!

- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlNsGcsACgkQO2KABBYQAh/ATgCfeQHS1KsZbLS8Hdux6p98K3w3
DqsAn3vZJMtYGs1xWK4ubK26ceuACtf1
=zPyS
-END PGP SIGNATURE-


Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-08 Thread Joshua Ladd
Chris,

The necessary packages will be supported and available in community OFED.

Josh


On Thu, May 8, 2014 at 9:23 AM, Chris Samuel  wrote:

> On Thu, 8 May 2014 09:10:00 AM Joshua Ladd wrote:
>
> > We (MLNX) are working on a new SLURM PMI2 plugin that we plan to
> eventually
> > push upstream. However, to use it, it will require linking in a
> proprietary
> > Mellanox library that accelerates the collective operations (available in
> > MOFED versions 2.1 and higher.)
>
> What about those of us who cannot run Mellanox OFED?
>
> All the best,
> Chris
> --
>  Christopher SamuelSenior Systems Administrator
>  VLSCI - Victorian Life Sciences Computation Initiative
>  Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
>  http://www.vlsci.org.au/  http://twitter.com/vlsci
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14755.php
>


Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-08 Thread Ralph Castain

On May 8, 2014, at 6:23 AM, Chris Samuel  wrote:

> On Thu, 8 May 2014 09:10:00 AM Joshua Ladd wrote:
> 
>> We (MLNX) are working on a new SLURM PMI2 plugin that we plan to eventually
>> push upstream. However, to use it, it will require linking in a proprietary
>> Mellanox library that accelerates the collective operations (available in
>> MOFED versions 2.1 and higher.)
> 
> What about those of us who cannot run Mellanox OFED?

Artem and I are working on a new PMIx plugin that will resolve it for 
non-Mellanox cases.

> 
> All the best,
> Chris
> -- 
> Christopher SamuelSenior Systems Administrator
> VLSCI - Victorian Life Sciences Computation Initiative
> Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
> http://www.vlsci.org.au/  http://twitter.com/vlsci
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14755.php



Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-08 Thread Chris Samuel
On Thu, 8 May 2014 09:10:00 AM Joshua Ladd wrote:

> We (MLNX) are working on a new SLURM PMI2 plugin that we plan to eventually
> push upstream. However, to use it, it will require linking in a proprietary
> Mellanox library that accelerates the collective operations (available in
> MOFED versions 2.1 and higher.)

What about those of us who cannot run Mellanox OFED?

All the best,
Chris
-- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci



Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-08 Thread Joshua Ladd
Hi, Adam

We (MLNX) are working on a new SLURM PMI2 plugin that we plan to eventually
push upstream. However, to use it, it will require linking in a proprietary
Mellanox library that accelerates the collective operations (available in
MOFED versions 2.1 and higher.)  Similar in spirit to the MXM MTL or FCA
COLL components in OMPI.

Best,

Josh


On Wed, May 7, 2014 at 11:45 AM, Moody, Adam T. <mood...@llnl.gov> wrote:

>  Hi Josh,
> Are your changes to OMPI or SLURM's PMI2 implementation?  Do you plan to
> push those changes back upstream?
> -Adam
>
>
>  --
> *From:* devel [devel-boun...@open-mpi.org] on behalf of Joshua Ladd [
> jladd.m...@gmail.com]
> *Sent:* Wednesday, May 07, 2014 7:56 AM
> *To:* Open MPI Developers
>
> *Subject:* Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is
> specifically requested
>
>   Ah, I see. Sorry for the reactionary comment - but this feature falls
> squarely within my "jurisdiction", and we've invested a lot in improving
> OMPI jobstart under srun.
>
> That being said (now that I've taken some deep breaths and carefully read
> your original email :)), what you're proposing isn't a bad idea. I think it
> would be good to maybe add a "--with-pmi2" flag to configure since
> "--with-pmi" automagically uses PMI2 if it finds the header and lib. This
> way, we could experiment with PMI1/PMI2 without having to rebuild SLURM or
> hack the installation.
>
>  Josh
>
>
> On Wed, May 7, 2014 at 10:45 AM, Ralph Castain <r...@open-mpi.org> wrote:
>
>> Okay, then we'll just have to develop a workaround for all those Slurm
>> releases where PMI-2 is borked :-(
>>
>>  FWIW: I think people misunderstood my statement. I specifically did
>> *not* propose to *lose* PMI-2 support. I suggested that we change it to
>> "on-by-request" instead of the current "on-by-default" so we wouldn't keep
>> getting asked about PMI-2 bugs in Slurm. Once the Slurm implementation
>> stabilized, then we could reverse that policy.
>>
>>  However, given that both you and Chris appear to prefer to keep it
>> "on-by-default", we'll see if we can find a way to detect that PMI-2 is
>> broken and then fall back to PMI-1.
>>
>>
>>   On May 7, 2014, at 7:39 AM, Joshua Ladd <jladd.m...@gmail.com> wrote:
>>
>>  Just saw this thread, and I second Chris' observations: at scale we
>> are seeing huge gains in jobstart performance with PMI2 over PMI1. We
>> *CANNOT* loose this functionality. For competitive reasons, I cannot
>> provide exact numbers, but let's say the difference is in the ballpark of a
>> full order-of-magnitude on 20K ranks versus PMI1. PMI1 is completely
>> unacceptable/unusable at scale. Certainly PMI2 still has scaling issues,
>> but there is no contest between PMI1 and PMI2.  We (MLNX) are actively
>> working to resolve some of the scalability issues in PMI2.
>>
>>  Josh
>>
>>  Joshua S. Ladd
>>  Staff Engineer, HPC Software
>>  Mellanox Technologies
>>
>>  Email: josh...@mellanox.com
>>
>>
>> On Wed, May 7, 2014 at 4:00 AM, Ralph Castain <r...@open-mpi.org> wrote:
>>
>>> Interesting - how many nodes were involved? As I said, the bad scaling
>>> becomes more evident at a fairly high node count.
>>>
>>> On May 7, 2014, at 12:07 AM, Christopher Samuel <sam...@unimelb.edu.au>
>>> wrote:
>>>
>>> > -BEGIN PGP SIGNED MESSAGE-
>>> > Hash: SHA1
>>> >
>>> > Hiya Ralph,
>>> >
>>> > On 07/05/14 14:49, Ralph Castain wrote:
>>> >
>>> >> I should have looked closer to see the numbers you posted, Chris -
>>> >> those include time for MPI wireup. So what you are seeing is that
>>> >> mpirun is much more efficient at exchanging the MPI endpoint info
>>> >> than PMI. I suspect that PMI2 is not much better as the primary
>>> >> reason for the difference is that mpriun sends blobs, while PMI
>>> >> requires that everything be encoded into strings and sent in little
>>> >> pieces.
>>> >>
>>> >> Hence, mpirun can exchange the endpoint info (the dreaded "modex"
>>> >> operation) much faster, and MPI_Init completes faster. Rest of the
>>> >> computation should be the same, so long compute apps will see the
>>> >> difference narrow considerably.
>>> >
>>> > Unfortunately it looks like I had an enthusiastic cleanup at some point
>>&g

Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-08 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 08/05/14 12:54, Ralph Castain wrote:

> I think there was one 2.6.x that was borked, and definitely
> problems in the 14.03.x line. Can't pinpoint it for you, though.

No worries, thanks.

> Sounds good. I'm going to have to dig deeper into those numbers, 
> though, as they don't entirely add up to me. Once the job gets 
> launched, the launch method itself should have no bearing on 
> computational speed - IF all things are equal. In other words, if
> the process layout is the same, and the binding pattern is the
> same, then computational speed should be roughly equivalent
> regardless of how the procs were started.

Not sure if it's significant but when mpirun was launching processes
it was using srun to start orted which then started MPI ranks whereas
with PMI/PMI2 it appeared to directly start the ranks.

> My guess is that your data might indicate a difference in the
> layout and/or binding pattern as opposed to PMI2 vs mpirun. At the
> scale you mention later in the thread (only 70 nodes x 16 ppn), the
> difference in launch timing would be zilch. So I'm betting you
> would find (upon further exploration) that (a) you might not have
> been binding processes when launching by mpirun, since we didn't
> bind by default until the 1.8 series, but were binding under direct
> srun launch, and (b) your process mapping would quite likely be
> different as we default to byslot mapping, and I believe srun
> defaults to bynode?

FWIW all our environment modules that do OMPI have:

setenv OMPI_MCA_orte_process_binding core

> Might be worth another comparison run when someone has time.

Yeah, I'll try and queue up some more tests - unfortunately the
cluster we tested on then is flat out at the moment but I'll try and
sneak a 64-core job using identical configs and compare mpirun, srun
on its own and srun with PMI2.

All the best,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlNq/K8ACgkQO2KABBYQAh/q0wCcDvYjl4tYVXrHNciCkKgbnwF7
VHoAn3Q+gZXQNKzs++3uajmiGTkq/EeD
=ucJg
-END PGP SIGNATURE-


Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-08 Thread Artem Polyakov
2014-05-08 9:54 GMT+07:00 Ralph Castain :

>
> On May 7, 2014, at 6:15 PM, Christopher Samuel 
> wrote:
>
> > -BEGIN PGP SIGNED MESSAGE-
> > Hash: SHA1
> >
> > Hi all,
> >
> > Apologies for having dropped out of the thread, night intervened here.
> ;-)
> >
> > On 08/05/14 00:45, Ralph Castain wrote:
> >
> >> Okay, then we'll just have to develop a workaround for all those
> >> Slurm releases where PMI-2 is borked :-(
> >
> > Do you know what these releases are?  Are we talking 2.6.x or 14.03?
> > The 14.03 series has had a fair few rapid point releases and doesn't
> > appear to be anywhere as near as stable as 2.6 was when it came out. :-(
>
> Yeah :-(
>
> I think there was one 2.6.x that was borked, and definitely problems in
> the 14.03.x line. Can't pinpoint it for you, though.
>

The bug I experienced with abnormal OMPI termination persist starting from
2.6.3 till latest slurm release. It may appear earlier - I didn't check.
However SLURM gyus didn't confirm that it's a bug acually. Things will get
clear after 2 weeks when the person who maintains the code will review the
patch. But I am pretty sure thats a bug.

Refer to this thread
http://thread.gmane.org/gmane.comp.distributed.slurm.devel/5213.



>
> >
> >> FWIW: I think people misunderstood my statement. I specifically
> >> did *not* propose to *lose* PMI-2 support. I suggested that we
> >> change it to "on-by-request" instead of the current "on-by-default"
> >> so we wouldn't keep getting asked about PMI-2 bugs in Slurm. Once
> >> the Slurm implementation stabilized, then we could reverse that
> >> policy.
> >>
> >> However, given that both you and Chris appear to prefer to keep it
> >> "on-by-default", we'll see if we can find a way to detect that
> >> PMI-2 is broken and then fall back to PMI-1.
> >
> > My intention was to provide the data that led us to want PMI2, but if
> > configure had an option to enable PMI2 by default so that only those
> > who requested it got it then I'd be more than happy - we'd just add it
> > to our script to build it.
>
> Sounds good. I'm going to have to dig deeper into those numbers, though,
> as they don't entirely add up to me. Once the job gets launched, the launch
> method itself should have no bearing on computational speed - IF all things
> are equal. In other words, if the process layout is the same, and the
> binding pattern is the same, then computational speed should be roughly
> equivalent regardless of how the procs were started.
>
> My guess is that your data might indicate a difference in the layout
> and/or binding pattern as opposed to PMI2 vs mpirun. At the scale you
> mention later in the thread (only 70 nodes x 16 ppn), the difference in
> launch timing would be zilch. So I'm betting you would find (upon further
> exploration) that (a) you might not have been binding processes when
> launching by mpirun, since we didn't bind by default until the 1.8 series,
> but were binding under direct srun launch, and (b) your process mapping
> would quite likely be different as we default to byslot mapping, and I
> believe srun defaults to bynode?
>
> Might be worth another comparison run when someone has time.
>
>
> >
> > All the best!
> > Chris
> > - --
> > Christopher SamuelSenior Systems Administrator
> > VLSCI - Victorian Life Sciences Computation Initiative
> > Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
> > http://www.vlsci.org.au/  http://twitter.com/vlsci
> >
> > -BEGIN PGP SIGNATURE-
> > Version: GnuPG v1.4.14 (GNU/Linux)
> > Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
> >
> > iEYEARECAAYFAlNq2poACgkQO2KABBYQAh+7DwCfeahirvoQ9Wom4VNhJIIdufeP
> > 7uIAnAruTnXZBn6HXhuMAlzzSsoKkXlt
> > =OvH4
> > -END PGP SIGNATURE-
> > ___
> > devel mailing list
> > de...@open-mpi.org
> > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14733.php
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14738.php
>



-- 
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov


Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-07 Thread Ralph Castain

On May 7, 2014, at 6:15 PM, Christopher Samuel  wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> Hi all,
> 
> Apologies for having dropped out of the thread, night intervened here. ;-)
> 
> On 08/05/14 00:45, Ralph Castain wrote:
> 
>> Okay, then we'll just have to develop a workaround for all those 
>> Slurm releases where PMI-2 is borked :-(
> 
> Do you know what these releases are?  Are we talking 2.6.x or 14.03?
> The 14.03 series has had a fair few rapid point releases and doesn't
> appear to be anywhere as near as stable as 2.6 was when it came out. :-(

Yeah :-(

I think there was one 2.6.x that was borked, and definitely problems in the 
14.03.x line. Can't pinpoint it for you, though.

> 
>> FWIW: I think people misunderstood my statement. I specifically
>> did *not* propose to *lose* PMI-2 support. I suggested that we
>> change it to "on-by-request" instead of the current "on-by-default"
>> so we wouldn't keep getting asked about PMI-2 bugs in Slurm. Once
>> the Slurm implementation stabilized, then we could reverse that
>> policy.
>> 
>> However, given that both you and Chris appear to prefer to keep it 
>> "on-by-default", we'll see if we can find a way to detect that
>> PMI-2 is broken and then fall back to PMI-1.
> 
> My intention was to provide the data that led us to want PMI2, but if
> configure had an option to enable PMI2 by default so that only those
> who requested it got it then I'd be more than happy - we'd just add it
> to our script to build it.

Sounds good. I'm going to have to dig deeper into those numbers, though, as 
they don't entirely add up to me. Once the job gets launched, the launch method 
itself should have no bearing on computational speed - IF all things are equal. 
In other words, if the process layout is the same, and the binding pattern is 
the same, then computational speed should be roughly equivalent regardless of 
how the procs were started.

My guess is that your data might indicate a difference in the layout and/or 
binding pattern as opposed to PMI2 vs mpirun. At the scale you mention later in 
the thread (only 70 nodes x 16 ppn), the difference in launch timing would be 
zilch. So I'm betting you would find (upon further exploration) that (a) you 
might not have been binding processes when launching by mpirun, since we didn't 
bind by default until the 1.8 series, but were binding under direct srun 
launch, and (b) your process mapping would quite likely be different as we 
default to byslot mapping, and I believe srun defaults to bynode?

Might be worth another comparison run when someone has time.


> 
> All the best!
> Chris
> - -- 
> Christopher SamuelSenior Systems Administrator
> VLSCI - Victorian Life Sciences Computation Initiative
> Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
> http://www.vlsci.org.au/  http://twitter.com/vlsci
> 
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.14 (GNU/Linux)
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
> 
> iEYEARECAAYFAlNq2poACgkQO2KABBYQAh+7DwCfeahirvoQ9Wom4VNhJIIdufeP
> 7uIAnAruTnXZBn6HXhuMAlzzSsoKkXlt
> =OvH4
> -END PGP SIGNATURE-
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14733.php



Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-07 Thread Ralph Castain

On May 7, 2014, at 6:51 PM, Christopher Samuel  wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> On 07/05/14 18:00, Ralph Castain wrote:
> 
>> Interesting - how many nodes were involved? As I said, the bad 
>> scaling becomes more evident at a fairly high node count.
> 
> Our x86-64 systems are low node counts (we've got BG/Q for capacity),
> the cluster that those tests were run on has 70 nodes, each with 16
> cores, so I suspect we're a long long way away from that pain point.

At least 25x, my friend :-)


> 
> All the best!
> Chris
> - -- 
> Christopher SamuelSenior Systems Administrator
> VLSCI - Victorian Life Sciences Computation Initiative
> Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
> http://www.vlsci.org.au/  http://twitter.com/vlsci
> 
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.14 (GNU/Linux)
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
> 
> iEYEARECAAYFAlNq4zQACgkQO2KABBYQAh8ErQCcCBFFeB5q27b7AkqfClliUdvC
> NJIAn1Cun+yY8zd6IToEsYJELpJTIdGb
> =K0XF
> -END PGP SIGNATURE-
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14734.php



Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-07 Thread Artem Polyakov
That is interesting. I think I will reconstruct your experiments on my
system when I will be testing PMI selection logic. According to your
resource count numbers I can do that. I will publish my results in the list.


2014-05-08 8:51 GMT+07:00 Christopher Samuel :

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> On 07/05/14 18:00, Ralph Castain wrote:
>
> > Interesting - how many nodes were involved? As I said, the bad
> > scaling becomes more evident at a fairly high node count.
>
> Our x86-64 systems are low node counts (we've got BG/Q for capacity),
> the cluster that those tests were run on has 70 nodes, each with 16
> cores, so I suspect we're a long long way away from that pain point.
>
> All the best!
> Chris
> - --
>  Christopher SamuelSenior Systems Administrator
>  VLSCI - Victorian Life Sciences Computation Initiative
>  Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
>  http://www.vlsci.org.au/  http://twitter.com/vlsci
>
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.14 (GNU/Linux)
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>
> iEYEARECAAYFAlNq4zQACgkQO2KABBYQAh8ErQCcCBFFeB5q27b7AkqfClliUdvC
> NJIAn1Cun+yY8zd6IToEsYJELpJTIdGb
> =K0XF
> -END PGP SIGNATURE-
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14734.php
>



-- 
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov


Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-07 Thread Artem Polyakov
Hi Chris.

Current disign is to provide the runtime parameter for PMI version
selection. It would be even more flexible that configuration-time selection
and (with my current understanding) not very hard to acheive.


2014-05-08 8:15 GMT+07:00 Christopher Samuel :

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> Hi all,
>
> Apologies for having dropped out of the thread, night intervened here. ;-)
>
> On 08/05/14 00:45, Ralph Castain wrote:
>
> > Okay, then we'll just have to develop a workaround for all those
> > Slurm releases where PMI-2 is borked :-(
>
> Do you know what these releases are?  Are we talking 2.6.x or 14.03?
> The 14.03 series has had a fair few rapid point releases and doesn't
> appear to be anywhere as near as stable as 2.6 was when it came out. :-(
>
> > FWIW: I think people misunderstood my statement. I specifically
> > did *not* propose to *lose* PMI-2 support. I suggested that we
> > change it to "on-by-request" instead of the current "on-by-default"
> > so we wouldn't keep getting asked about PMI-2 bugs in Slurm. Once
> > the Slurm implementation stabilized, then we could reverse that
> > policy.
> >
> > However, given that both you and Chris appear to prefer to keep it
> > "on-by-default", we'll see if we can find a way to detect that
> > PMI-2 is broken and then fall back to PMI-1.
>
> My intention was to provide the data that led us to want PMI2, but if
> configure had an option to enable PMI2 by default so that only those
> who requested it got it then I'd be more than happy - we'd just add it
> to our script to build it.
>
> All the best!
> Chris
> - --
>  Christopher SamuelSenior Systems Administrator
>  VLSCI - Victorian Life Sciences Computation Initiative
>  Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
>  http://www.vlsci.org.au/  http://twitter.com/vlsci
>
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.14 (GNU/Linux)
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>
> iEYEARECAAYFAlNq2poACgkQO2KABBYQAh+7DwCfeahirvoQ9Wom4VNhJIIdufeP
> 7uIAnAruTnXZBn6HXhuMAlzzSsoKkXlt
> =OvH4
> -END PGP SIGNATURE-
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14733.php
>



-- 
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov


Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-07 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 07/05/14 18:00, Ralph Castain wrote:

> Interesting - how many nodes were involved? As I said, the bad 
> scaling becomes more evident at a fairly high node count.

Our x86-64 systems are low node counts (we've got BG/Q for capacity),
the cluster that those tests were run on has 70 nodes, each with 16
cores, so I suspect we're a long long way away from that pain point.

All the best!
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlNq4zQACgkQO2KABBYQAh8ErQCcCBFFeB5q27b7AkqfClliUdvC
NJIAn1Cun+yY8zd6IToEsYJELpJTIdGb
=K0XF
-END PGP SIGNATURE-


Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-07 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi all,

Apologies for having dropped out of the thread, night intervened here. ;-)

On 08/05/14 00:45, Ralph Castain wrote:

> Okay, then we'll just have to develop a workaround for all those 
> Slurm releases where PMI-2 is borked :-(

Do you know what these releases are?  Are we talking 2.6.x or 14.03?
The 14.03 series has had a fair few rapid point releases and doesn't
appear to be anywhere as near as stable as 2.6 was when it came out. :-(

> FWIW: I think people misunderstood my statement. I specifically
> did *not* propose to *lose* PMI-2 support. I suggested that we
> change it to "on-by-request" instead of the current "on-by-default"
> so we wouldn't keep getting asked about PMI-2 bugs in Slurm. Once
> the Slurm implementation stabilized, then we could reverse that
> policy.
> 
> However, given that both you and Chris appear to prefer to keep it 
> "on-by-default", we'll see if we can find a way to detect that
> PMI-2 is broken and then fall back to PMI-1.

My intention was to provide the data that led us to want PMI2, but if
configure had an option to enable PMI2 by default so that only those
who requested it got it then I'd be more than happy - we'd just add it
to our script to build it.

All the best!
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlNq2poACgkQO2KABBYQAh+7DwCfeahirvoQ9Wom4VNhJIIdufeP
7uIAnAruTnXZBn6HXhuMAlzzSsoKkXlt
=OvH4
-END PGP SIGNATURE-


Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-07 Thread Artem Polyakov
2014-05-08 7:15 GMT+07:00 Ralph Castain :

> Take a look in opal/mca/common/pmi - we already do a bunch of #if PMI2
> stuff in there. All we are talking about doing here is:
>
> * making those selections be runtime based on an MCA param, compiling if
> PMI2 is available but selected at runtime
>
> * moving some additional functions into that code area and out of the
> individual components
>

Ok, that is pretty clear now. And will do exactly #2.
Thank you.


>
>
> On May 7, 2014, at 5:08 PM, Artem Polyakov  wrote:
>
> I like #2 too.
> But my question was slightly different. Can we incapsulate PMI logic that
> OMPI use in common/pmi as #2 suggests but have 2 different
> implementations of this component say common/pmi and common/pmi2? I am
> asking because I have concerns that this kind of component is not supposed
> to be duplicated.
> In this case we could have one common MCA parameter and 2 components as it
> was suggested by Jeff.
>
>
> 2014-05-08 7:01 GMT+07:00 Ralph Castain :
>
>> The desired solution is to have the ability to select pmi-1 vs pmi-2 at
>> runtime. This can be done in two ways:
>>
>> 1. you could have separate pmi1 and pmi2 components in each framework.
>> You'd want to define only one common MCA param to direct the selection,
>> however.
>>
>> 2. you could have a single pmi component in each framework, calling code
>> in the appropriate common/pmi location. You would then need a runtime MCA
>> param to select whether pmi-1 or pmi-2 was going to be used, and have the
>> common code check before making the desired calls.
>>
>> The choice of method is left up to you. They each have their negatives.
>> If it were me, I'd probably try #2 first, assuming the codes are mostly
>> common in the individual frameworks.
>>
>>
>> On May 7, 2014, at 4:51 PM, Artem Polyakov  wrote:
>>
>>  Just reread your suggestions in our out-of-list discussion and found
>> that I misunderstand it. So no parallel PMI! Take all possible code into
>> opal/mca/common/pmi.
>> To additionally clarify what is the preferred way:
>> 1. to create one joined PMI module having a switches to decide what
>> functiononality to implement.
>> 2. or to have 2 separate common modules for PMI1 and one for PMI2, and
>> does this fit opal/mca/common/ ideology at all?
>>
>>
>> 2014-05-08 6:44 GMT+07:00 Artem Polyakov :
>>
>>>
>>> 2014-05-08 5:54 GMT+07:00 Ralph Castain :
>>>
>>> Ummmno, I don't think that's right. I believe we decided to instead
 create the separate components, default to PMI-2 if available, print nice
 error message if not, otherwise use PMI-1.

 I don't want to initialize both PMIs in parallel as most installations
 won't support it.

>>>
>>> Ok, I agree. Beside the lack of support there can be a performance hit
>>> caused by PMI1 initialization at scale. This is not a case of SLURM PMI1
>>> since it is quite simple and local. But I didn't consider other
>>> implementations.
>>>
>>> On May 7, 2014, at 3:49 PM, Artem Polyakov  wrote:

 We discussed with Ralph Joshuas concerns and decided to try automatic
 PMI2 correctness first as it was initially intended. Here is my idea. The
 universal way to decide if PMI2 is correct is to compare PMI_Init(..,
 , , ...) and PMI2_Init(.., , , ...). Size and rank
 should be equal. In this case we proceed with PMI2 finalizing PMI1.
 Otherwise we finalize PMI2 and proceed with PMI1.
 I need to clarify with SLURM guys if parallel initialization of both
 PMIs are legal. If not - we'll do that sequentially.
 In other places we'll just use the flag saying what PMI version to use.
 Does that sounds reasonable?

 2014-05-07 23:10 GMT+07:00 Artem Polyakov :

> That's a good point. There is actually a bunch of modules in ompi,
> opal and orte that has to be duplicated.
>
> среда, 7 мая 2014 г. пользователь Joshua Ladd написал:
>
>> +1 Sounds like a good idea - but decoupling the two and adding all
>> the right selection mojo might be a bit of a pain. There are several 
>> places
>> in OMPI where the distinction between PMI1and PMI2 is made, not only in
>> grpcomm. DB and ESS frameworks off the top of my head.
>>
>> Josh
>>
>>
>> On Wed, May 7, 2014 at 11:48 AM, Artem Polyakov 
>> wrote:
>>
>>> Good idea :)!
>>>
>>> среда, 7 мая 2014 г. пользователь Ralph Castain написал:
>>>
>>> Jeff actually had a useful suggestion (gasp!).He proposed that we
>>> separate the PMI-1 and PMI-2 codes into separate components so you could
>>> select them at runtime. Thus, we would build both (assuming both PMI-1 
>>> and
>>> 2 libs are found), default to PMI-1, but users could select to try 
>>> PMI-2.
>>> If the PMI-2 component failed, we would 

Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-07 Thread Ralph Castain
Take a look in opal/mca/common/pmi - we already do a bunch of #if PMI2 stuff in 
there. All we are talking about doing here is:

* making those selections be runtime based on an MCA param, compiling if PMI2 
is available but selected at runtime

* moving some additional functions into that code area and out of the 
individual components


On May 7, 2014, at 5:08 PM, Artem Polyakov  wrote:

> I like #2 too. 
> But my question was slightly different. Can we incapsulate PMI logic that 
> OMPI use in common/pmi as #2 suggests but have 2 different implementations of 
> this component say common/pmi and common/pmi2? I am asking because I have 
> concerns that this kind of component is not supposed to be duplicated.
> In this case we could have one common MCA parameter and 2 components as it 
> was suggested by Jeff.
> 
> 
> 2014-05-08 7:01 GMT+07:00 Ralph Castain :
> The desired solution is to have the ability to select pmi-1 vs pmi-2 at 
> runtime. This can be done in two ways:
> 
> 1. you could have separate pmi1 and pmi2 components in each framework. You'd 
> want to define only one common MCA param to direct the selection, however.
> 
> 2. you could have a single pmi component in each framework, calling code in 
> the appropriate common/pmi location. You would then need a runtime MCA param 
> to select whether pmi-1 or pmi-2 was going to be used, and have the common 
> code check before making the desired calls.
> 
> The choice of method is left up to you. They each have their negatives. If it 
> were me, I'd probably try #2 first, assuming the codes are mostly common in 
> the individual frameworks.
> 
> 
> On May 7, 2014, at 4:51 PM, Artem Polyakov  wrote:
> 
>> Just reread your suggestions in our out-of-list discussion and found that I 
>> misunderstand it. So no parallel PMI! Take all possible code into 
>> opal/mca/common/pmi.
>> To additionally clarify what is the preferred way:
>> 1. to create one joined PMI module having a switches to decide what 
>> functiononality to implement.
>> 2. or to have 2 separate common modules for PMI1 and one for PMI2, and does 
>> this fit opal/mca/common/ ideology at all?
>> 
>> 
>> 2014-05-08 6:44 GMT+07:00 Artem Polyakov :
>> 
>> 2014-05-08 5:54 GMT+07:00 Ralph Castain :
>> 
>> Ummmno, I don't think that's right. I believe we decided to instead 
>> create the separate components, default to PMI-2 if available, print nice 
>> error message if not, otherwise use PMI-1.
>> 
>> I don't want to initialize both PMIs in parallel as most installations won't 
>> support it.
>> 
>> Ok, I agree. Beside the lack of support there can be a performance hit 
>> caused by PMI1 initialization at scale. This is not a case of SLURM PMI1 
>> since it is quite simple and local. But I didn't consider other 
>> implementations.
>> 
>> On May 7, 2014, at 3:49 PM, Artem Polyakov  wrote:
>> 
>>> We discussed with Ralph Joshuas concerns and decided to try automatic PMI2 
>>> correctness first as it was initially intended. Here is my idea. The 
>>> universal way to decide if PMI2 is correct is to compare PMI_Init(.., 
>>> , , ...) and PMI2_Init(.., , , ...). Size and rank 
>>> should be equal. In this case we proceed with PMI2 finalizing PMI1. 
>>> Otherwise we finalize PMI2 and proceed with PMI1.
>>> I need to clarify with SLURM guys if parallel initialization of both PMIs 
>>> are legal. If not - we'll do that sequentially. 
>>> In other places we'll just use the flag saying what PMI version to use.
>>> Does that sounds reasonable?
>>> 
>>> 2014-05-07 23:10 GMT+07:00 Artem Polyakov :
>>> That's a good point. There is actually a bunch of modules in ompi, opal and 
>>> orte that has to be duplicated.
>>> 
>>> среда, 7 мая 2014 г. пользователь Joshua Ladd написал:
>>> +1 Sounds like a good idea - but decoupling the two and adding all the 
>>> right selection mojo might be a bit of a pain. There are several places in 
>>> OMPI where the distinction between PMI1and PMI2 is made, not only in 
>>> grpcomm. DB and ESS frameworks off the top of my head.
>>> 
>>> Josh
>>> 
>>> 
>>> On Wed, May 7, 2014 at 11:48 AM, Artem Polyakov  wrote:
>>> Good idea :)!
>>> 
>>> среда, 7 мая 2014 г. пользователь Ralph Castain написал:
>>> 
>>> Jeff actually had a useful suggestion (gasp!).He proposed that we separate 
>>> the PMI-1 and PMI-2 codes into separate components so you could select them 
>>> at runtime. Thus, we would build both (assuming both PMI-1 and 2 libs are 
>>> found), default to PMI-1, but users could select to try PMI-2. If the PMI-2 
>>> component failed, we would emit a show_help indicating that they probably 
>>> have a broken PMI-2 version and should try PMI-1.
>>> 
>>> Make sense?
>>> Ralph
>>> 
>>> On May 7, 2014, at 8:00 AM, Ralph Castain  wrote:
>>> 
 
 On May 7, 2014, at 7:56 AM, Joshua Ladd 

Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-07 Thread Artem Polyakov
I like #2 too.
But my question was slightly different. Can we incapsulate PMI logic that
OMPI use in common/pmi as #2 suggests but have 2 different implementations
of this component say common/pmi and common/pmi2? I am asking because I
have concerns that this kind of component is not supposed to be duplicated.
In this case we could have one common MCA parameter and 2 components as it
was suggested by Jeff.


2014-05-08 7:01 GMT+07:00 Ralph Castain :

> The desired solution is to have the ability to select pmi-1 vs pmi-2 at
> runtime. This can be done in two ways:
>
> 1. you could have separate pmi1 and pmi2 components in each framework.
> You'd want to define only one common MCA param to direct the selection,
> however.
>
> 2. you could have a single pmi component in each framework, calling code
> in the appropriate common/pmi location. You would then need a runtime MCA
> param to select whether pmi-1 or pmi-2 was going to be used, and have the
> common code check before making the desired calls.
>
> The choice of method is left up to you. They each have their negatives. If
> it were me, I'd probably try #2 first, assuming the codes are mostly common
> in the individual frameworks.
>
>
> On May 7, 2014, at 4:51 PM, Artem Polyakov  wrote:
>
> Just reread your suggestions in our out-of-list discussion and found that
> I misunderstand it. So no parallel PMI! Take all possible code into
> opal/mca/common/pmi.
> To additionally clarify what is the preferred way:
> 1. to create one joined PMI module having a switches to decide what
> functiononality to implement.
> 2. or to have 2 separate common modules for PMI1 and one for PMI2, and
> does this fit opal/mca/common/ ideology at all?
>
>
> 2014-05-08 6:44 GMT+07:00 Artem Polyakov :
>
>>
>> 2014-05-08 5:54 GMT+07:00 Ralph Castain :
>>
>> Ummmno, I don't think that's right. I believe we decided to instead
>>> create the separate components, default to PMI-2 if available, print nice
>>> error message if not, otherwise use PMI-1.
>>>
>>> I don't want to initialize both PMIs in parallel as most installations
>>> won't support it.
>>>
>>
>> Ok, I agree. Beside the lack of support there can be a performance hit
>> caused by PMI1 initialization at scale. This is not a case of SLURM PMI1
>> since it is quite simple and local. But I didn't consider other
>> implementations.
>>
>> On May 7, 2014, at 3:49 PM, Artem Polyakov  wrote:
>>>
>>> We discussed with Ralph Joshuas concerns and decided to try automatic
>>> PMI2 correctness first as it was initially intended. Here is my idea. The
>>> universal way to decide if PMI2 is correct is to compare PMI_Init(..,
>>> , , ...) and PMI2_Init(.., , , ...). Size and rank
>>> should be equal. In this case we proceed with PMI2 finalizing PMI1.
>>> Otherwise we finalize PMI2 and proceed with PMI1.
>>> I need to clarify with SLURM guys if parallel initialization of both
>>> PMIs are legal. If not - we'll do that sequentially.
>>> In other places we'll just use the flag saying what PMI version to use.
>>> Does that sounds reasonable?
>>>
>>> 2014-05-07 23:10 GMT+07:00 Artem Polyakov :
>>>
 That's a good point. There is actually a bunch of modules in ompi, opal
 and orte that has to be duplicated.

 среда, 7 мая 2014 г. пользователь Joshua Ladd написал:

> +1 Sounds like a good idea - but decoupling the two and adding all the
> right selection mojo might be a bit of a pain. There are several places in
> OMPI where the distinction between PMI1and PMI2 is made, not only in
> grpcomm. DB and ESS frameworks off the top of my head.
>
> Josh
>
>
> On Wed, May 7, 2014 at 11:48 AM, Artem Polyakov 
> wrote:
>
>> Good idea :)!
>>
>> среда, 7 мая 2014 г. пользователь Ralph Castain написал:
>>
>> Jeff actually had a useful suggestion (gasp!).He proposed that we
>> separate the PMI-1 and PMI-2 codes into separate components so you could
>> select them at runtime. Thus, we would build both (assuming both PMI-1 
>> and
>> 2 libs are found), default to PMI-1, but users could select to try PMI-2.
>> If the PMI-2 component failed, we would emit a show_help indicating that
>> they probably have a broken PMI-2 version and should try PMI-1.
>>
>> Make sense?
>> Ralph
>>
>> On May 7, 2014, at 8:00 AM, Ralph Castain  wrote:
>>
>>
>> On May 7, 2014, at 7:56 AM, Joshua Ladd  wrote:
>>
>> Ah, I see. Sorry for the reactionary comment - but this feature falls
>> squarely within my "jurisdiction", and we've invested a lot in improving
>> OMPI jobstart under srun.
>>
>> That being said (now that I've taken some deep breaths and carefully
>> read your original email :)), what you're proposing isn't a bad idea. I
>> think 

Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-07 Thread Ralph Castain
The desired solution is to have the ability to select pmi-1 vs pmi-2 at 
runtime. This can be done in two ways:

1. you could have separate pmi1 and pmi2 components in each framework. You'd 
want to define only one common MCA param to direct the selection, however.

2. you could have a single pmi component in each framework, calling code in the 
appropriate common/pmi location. You would then need a runtime MCA param to 
select whether pmi-1 or pmi-2 was going to be used, and have the common code 
check before making the desired calls.

The choice of method is left up to you. They each have their negatives. If it 
were me, I'd probably try #2 first, assuming the codes are mostly common in the 
individual frameworks.


On May 7, 2014, at 4:51 PM, Artem Polyakov  wrote:

> Just reread your suggestions in our out-of-list discussion and found that I 
> misunderstand it. So no parallel PMI! Take all possible code into 
> opal/mca/common/pmi.
> To additionally clarify what is the preferred way:
> 1. to create one joined PMI module having a switches to decide what 
> functiononality to implement.
> 2. or to have 2 separate common modules for PMI1 and one for PMI2, and does 
> this fit opal/mca/common/ ideology at all?
> 
> 
> 2014-05-08 6:44 GMT+07:00 Artem Polyakov :
> 
> 2014-05-08 5:54 GMT+07:00 Ralph Castain :
> 
> Ummmno, I don't think that's right. I believe we decided to instead 
> create the separate components, default to PMI-2 if available, print nice 
> error message if not, otherwise use PMI-1.
> 
> I don't want to initialize both PMIs in parallel as most installations won't 
> support it.
> 
> Ok, I agree. Beside the lack of support there can be a performance hit caused 
> by PMI1 initialization at scale. This is not a case of SLURM PMI1 since it is 
> quite simple and local. But I didn't consider other implementations.
> 
> On May 7, 2014, at 3:49 PM, Artem Polyakov  wrote:
> 
>> We discussed with Ralph Joshuas concerns and decided to try automatic PMI2 
>> correctness first as it was initially intended. Here is my idea. The 
>> universal way to decide if PMI2 is correct is to compare PMI_Init(.., , 
>> , ...) and PMI2_Init(.., , , ...). Size and rank should be 
>> equal. In this case we proceed with PMI2 finalizing PMI1. Otherwise we 
>> finalize PMI2 and proceed with PMI1.
>> I need to clarify with SLURM guys if parallel initialization of both PMIs 
>> are legal. If not - we'll do that sequentially. 
>> In other places we'll just use the flag saying what PMI version to use.
>> Does that sounds reasonable?
>> 
>> 2014-05-07 23:10 GMT+07:00 Artem Polyakov :
>> That's a good point. There is actually a bunch of modules in ompi, opal and 
>> orte that has to be duplicated.
>> 
>> среда, 7 мая 2014 г. пользователь Joshua Ladd написал:
>> +1 Sounds like a good idea - but decoupling the two and adding all the right 
>> selection mojo might be a bit of a pain. There are several places in OMPI 
>> where the distinction between PMI1and PMI2 is made, not only in grpcomm. DB 
>> and ESS frameworks off the top of my head.
>> 
>> Josh
>> 
>> 
>> On Wed, May 7, 2014 at 11:48 AM, Artem Polyakov  wrote:
>> Good idea :)!
>> 
>> среда, 7 мая 2014 г. пользователь Ralph Castain написал:
>> 
>> Jeff actually had a useful suggestion (gasp!).He proposed that we separate 
>> the PMI-1 and PMI-2 codes into separate components so you could select them 
>> at runtime. Thus, we would build both (assuming both PMI-1 and 2 libs are 
>> found), default to PMI-1, but users could select to try PMI-2. If the PMI-2 
>> component failed, we would emit a show_help indicating that they probably 
>> have a broken PMI-2 version and should try PMI-1.
>> 
>> Make sense?
>> Ralph
>> 
>> On May 7, 2014, at 8:00 AM, Ralph Castain  wrote:
>> 
>>> 
>>> On May 7, 2014, at 7:56 AM, Joshua Ladd  wrote:
>>> 
 Ah, I see. Sorry for the reactionary comment - but this feature falls 
 squarely within my "jurisdiction", and we've invested a lot in improving 
 OMPI jobstart under srun. 
 
 That being said (now that I've taken some deep breaths and carefully read 
 your original email :)), what you're proposing isn't a bad idea. I think 
 it would be good to maybe add a "--with-pmi2" flag to configure since 
 "--with-pmi" automagically uses PMI2 if it finds the header and lib. This 
 way, we could experiment with PMI1/PMI2 without having to rebuild SLURM or 
 hack the installation. 
>>> 
>>> That would be a much simpler solution than what Artem proposed (off-list) 
>>> where we would try PMI2 and then if it didn't work try to figure out how to 
>>> fall back to PMI1. I'll add this for now, and if Artem wants to try his 
>>> more automagic solution and can make it work, then we can reconsider that 
>>> option.
>>> 
>>> Thanks
>>> Ralph
>>> 
 
 Josh  

Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-07 Thread Artem Polyakov
Just reread your suggestions in our out-of-list discussion and found that I
misunderstand it. So no parallel PMI! Take all possible code into
opal/mca/common/pmi.

To additionally clarify what is the preferred way:
1. to create one joined PMI module having a switches to decide what
functiononality to implement.
2. or to have 2 separate common modules for PMI1 and one for PMI2, and does
this fit opal/mca/common/ ideology at all?


2014-05-08 6:44 GMT+07:00 Artem Polyakov :

>
> 2014-05-08 5:54 GMT+07:00 Ralph Castain :
>
> Ummmno, I don't think that's right. I believe we decided to instead
>> create the separate components, default to PMI-2 if available, print nice
>> error message if not, otherwise use PMI-1.
>>
>> I don't want to initialize both PMIs in parallel as most installations
>> won't support it.
>>
>
> Ok, I agree. Beside the lack of support there can be a performance hit
> caused by PMI1 initialization at scale. This is not a case of SLURM PMI1
> since it is quite simple and local. But I didn't consider other
> implementations.
>
> On May 7, 2014, at 3:49 PM, Artem Polyakov  wrote:
>>
>> We discussed with Ralph Joshuas concerns and decided to try automatic
>> PMI2 correctness first as it was initially intended. Here is my idea. The
>> universal way to decide if PMI2 is correct is to compare PMI_Init(..,
>> , , ...) and PMI2_Init(.., , , ...). Size and rank
>> should be equal. In this case we proceed with PMI2 finalizing PMI1.
>> Otherwise we finalize PMI2 and proceed with PMI1.
>> I need to clarify with SLURM guys if parallel initialization of both PMIs
>> are legal. If not - we'll do that sequentially.
>> In other places we'll just use the flag saying what PMI version to use.
>> Does that sounds reasonable?
>>
>> 2014-05-07 23:10 GMT+07:00 Artem Polyakov :
>>
>>> That's a good point. There is actually a bunch of modules in ompi, opal
>>> and orte that has to be duplicated.
>>>
>>> среда, 7 мая 2014 г. пользователь Joshua Ladd написал:
>>>
 +1 Sounds like a good idea - but decoupling the two and adding all the
 right selection mojo might be a bit of a pain. There are several places in
 OMPI where the distinction between PMI1and PMI2 is made, not only in
 grpcomm. DB and ESS frameworks off the top of my head.

 Josh


 On Wed, May 7, 2014 at 11:48 AM, Artem Polyakov 
 wrote:

> Good idea :)!
>
> среда, 7 мая 2014 г. пользователь Ralph Castain написал:
>
> Jeff actually had a useful suggestion (gasp!).He proposed that we
> separate the PMI-1 and PMI-2 codes into separate components so you could
> select them at runtime. Thus, we would build both (assuming both PMI-1 and
> 2 libs are found), default to PMI-1, but users could select to try PMI-2.
> If the PMI-2 component failed, we would emit a show_help indicating that
> they probably have a broken PMI-2 version and should try PMI-1.
>
> Make sense?
> Ralph
>
> On May 7, 2014, at 8:00 AM, Ralph Castain  wrote:
>
>
> On May 7, 2014, at 7:56 AM, Joshua Ladd  wrote:
>
> Ah, I see. Sorry for the reactionary comment - but this feature falls
> squarely within my "jurisdiction", and we've invested a lot in improving
> OMPI jobstart under srun.
>
> That being said (now that I've taken some deep breaths and carefully
> read your original email :)), what you're proposing isn't a bad idea. I
> think it would be good to maybe add a "--with-pmi2" flag to configure 
> since
> "--with-pmi" automagically uses PMI2 if it finds the header and lib. This
> way, we could experiment with PMI1/PMI2 without having to rebuild SLURM or
> hack the installation.
>
>
> That would be a much simpler solution than what Artem proposed
> (off-list) where we would try PMI2 and then if it didn't work try to 
> figure
> out how to fall back to PMI1. I'll add this for now, and if Artem wants to
> try his more automagic solution and can make it work, then we can
> reconsider that option.
>
> Thanks
> Ralph
>
>
> Josh
>
>
> On Wed, May 7, 2014 at 10:45 AM, Ralph Castain 
> wrote:
>
> Okay, then we'll just have to develop a workaround for all those Slurm
> releases where PMI-2 is borked :-(
>
> FWIW: I think people misunderstood my statement. I specifically did
> *not* propose to *lose* PMI-2 support. I suggested that we change it to
> "on-by-request" instead of the current "on-by-default" so we wouldn't keep
> getting asked about PMI-2 bugs in Slurm. Once the Slurm implementation
> stabilized, then we could reverse that policy.
>
> However, given that both you and Chris appear to prefer to keep it
> "on-by-default", we'll see if we can find a way 

Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-07 Thread Artem Polyakov
2014-05-08 5:54 GMT+07:00 Ralph Castain :

> Ummmno, I don't think that's right. I believe we decided to instead
> create the separate components, default to PMI-2 if available, print nice
> error message if not, otherwise use PMI-1.
>
> I don't want to initialize both PMIs in parallel as most installations
> won't support it.
>

Ok, I agree. Beside the lack of support there can be a performance hit
caused by PMI1 initialization at scale. This is not a case of SLURM PMI1
since it is quite simple and local. But I didn't consider other
implementations.

On May 7, 2014, at 3:49 PM, Artem Polyakov  wrote:
>
> We discussed with Ralph Joshuas concerns and decided to try automatic PMI2
> correctness first as it was initially intended. Here is my idea. The
> universal way to decide if PMI2 is correct is to compare PMI_Init(..,
> , , ...) and PMI2_Init(.., , , ...). Size and rank
> should be equal. In this case we proceed with PMI2 finalizing PMI1.
> Otherwise we finalize PMI2 and proceed with PMI1.
> I need to clarify with SLURM guys if parallel initialization of both PMIs
> are legal. If not - we'll do that sequentially.
> In other places we'll just use the flag saying what PMI version to use.
> Does that sounds reasonable?
>
> 2014-05-07 23:10 GMT+07:00 Artem Polyakov :
>
>> That's a good point. There is actually a bunch of modules in ompi, opal
>> and orte that has to be duplicated.
>>
>> среда, 7 мая 2014 г. пользователь Joshua Ladd написал:
>>
>>> +1 Sounds like a good idea - but decoupling the two and adding all the
>>> right selection mojo might be a bit of a pain. There are several places in
>>> OMPI where the distinction between PMI1and PMI2 is made, not only in
>>> grpcomm. DB and ESS frameworks off the top of my head.
>>>
>>> Josh
>>>
>>>
>>> On Wed, May 7, 2014 at 11:48 AM, Artem Polyakov 
>>> wrote:
>>>
 Good idea :)!

 среда, 7 мая 2014 г. пользователь Ralph Castain написал:

 Jeff actually had a useful suggestion (gasp!).He proposed that we
 separate the PMI-1 and PMI-2 codes into separate components so you could
 select them at runtime. Thus, we would build both (assuming both PMI-1 and
 2 libs are found), default to PMI-1, but users could select to try PMI-2.
 If the PMI-2 component failed, we would emit a show_help indicating that
 they probably have a broken PMI-2 version and should try PMI-1.

 Make sense?
 Ralph

 On May 7, 2014, at 8:00 AM, Ralph Castain  wrote:


 On May 7, 2014, at 7:56 AM, Joshua Ladd  wrote:

 Ah, I see. Sorry for the reactionary comment - but this feature falls
 squarely within my "jurisdiction", and we've invested a lot in improving
 OMPI jobstart under srun.

 That being said (now that I've taken some deep breaths and carefully
 read your original email :)), what you're proposing isn't a bad idea. I
 think it would be good to maybe add a "--with-pmi2" flag to configure since
 "--with-pmi" automagically uses PMI2 if it finds the header and lib. This
 way, we could experiment with PMI1/PMI2 without having to rebuild SLURM or
 hack the installation.


 That would be a much simpler solution than what Artem proposed
 (off-list) where we would try PMI2 and then if it didn't work try to figure
 out how to fall back to PMI1. I'll add this for now, and if Artem wants to
 try his more automagic solution and can make it work, then we can
 reconsider that option.

 Thanks
 Ralph


 Josh


 On Wed, May 7, 2014 at 10:45 AM, Ralph Castain 
 wrote:

 Okay, then we'll just have to develop a workaround for all those Slurm
 releases where PMI-2 is borked :-(

 FWIW: I think people misunderstood my statement. I specifically did
 *not* propose to *lose* PMI-2 support. I suggested that we change it to
 "on-by-request" instead of the current "on-by-default" so we wouldn't keep
 getting asked about PMI-2 bugs in Slurm. Once the Slurm implementation
 stabilized, then we could reverse that policy.

 However, given that both you and Chris appear to prefer to keep it
 "on-by-default", we'll see if we can find a way to detect that PMI-2 is
 broken and then fall back to PMI-1.


 On May 7, 2014, at 7:39 AM, Joshua Ladd  wrote:

 Just saw this thread, and I second Chris' observations: at scale we are
 seeing huge gains in jobstart performance with PMI2 over PMI1. We
 *CANNOT* loose this functionality. For competitive reasons, I cannot
 provide exact numbers, but let's say the difference is in the ballpark of a
 full order-of-magnitude on 20K ranks versus PMI1. PMI1 is completely
 unacceptable/unusable at scale. Certainly PMI2 still has scaling issues,
 but there 

Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-07 Thread Ralph Castain
Ummmno, I don't think that's right. I believe we decided to instead create 
the separate components, default to PMI-2 if available, print nice error 
message if not, otherwise use PMI-1.

I don't want to initialize both PMIs in parallel as most installations won't 
support it.


On May 7, 2014, at 3:49 PM, Artem Polyakov  wrote:

> We discussed with Ralph Joshuas concerns and decided to try automatic PMI2 
> correctness first as it was initially intended. Here is my idea. The 
> universal way to decide if PMI2 is correct is to compare PMI_Init(.., , 
> , ...) and PMI2_Init(.., , , ...). Size and rank should be 
> equal. In this case we proceed with PMI2 finalizing PMI1. Otherwise we 
> finalize PMI2 and proceed with PMI1.
> I need to clarify with SLURM guys if parallel initialization of both PMIs are 
> legal. If not - we'll do that sequentially. 
> In other places we'll just use the flag saying what PMI version to use.
> Does that sounds reasonable?
> 
> 2014-05-07 23:10 GMT+07:00 Artem Polyakov :
> That's a good point. There is actually a bunch of modules in ompi, opal and 
> orte that has to be duplicated.
> 
> среда, 7 мая 2014 г. пользователь Joshua Ladd написал:
> +1 Sounds like a good idea - but decoupling the two and adding all the right 
> selection mojo might be a bit of a pain. There are several places in OMPI 
> where the distinction between PMI1and PMI2 is made, not only in grpcomm. DB 
> and ESS frameworks off the top of my head.
> 
> Josh
> 
> 
> On Wed, May 7, 2014 at 11:48 AM, Artem Polyakov  wrote:
> Good idea :)!
> 
> среда, 7 мая 2014 г. пользователь Ralph Castain написал:
> 
> Jeff actually had a useful suggestion (gasp!).He proposed that we separate 
> the PMI-1 and PMI-2 codes into separate components so you could select them 
> at runtime. Thus, we would build both (assuming both PMI-1 and 2 libs are 
> found), default to PMI-1, but users could select to try PMI-2. If the PMI-2 
> component failed, we would emit a show_help indicating that they probably 
> have a broken PMI-2 version and should try PMI-1.
> 
> Make sense?
> Ralph
> 
> On May 7, 2014, at 8:00 AM, Ralph Castain  wrote:
> 
>> 
>> On May 7, 2014, at 7:56 AM, Joshua Ladd  wrote:
>> 
>>> Ah, I see. Sorry for the reactionary comment - but this feature falls 
>>> squarely within my "jurisdiction", and we've invested a lot in improving 
>>> OMPI jobstart under srun. 
>>> 
>>> That being said (now that I've taken some deep breaths and carefully read 
>>> your original email :)), what you're proposing isn't a bad idea. I think it 
>>> would be good to maybe add a "--with-pmi2" flag to configure since 
>>> "--with-pmi" automagically uses PMI2 if it finds the header and lib. This 
>>> way, we could experiment with PMI1/PMI2 without having to rebuild SLURM or 
>>> hack the installation. 
>> 
>> That would be a much simpler solution than what Artem proposed (off-list) 
>> where we would try PMI2 and then if it didn't work try to figure out how to 
>> fall back to PMI1. I'll add this for now, and if Artem wants to try his more 
>> automagic solution and can make it work, then we can reconsider that option.
>> 
>> Thanks
>> Ralph
>> 
>>> 
>>> Josh  
>>> 
>>> 
>>> On Wed, May 7, 2014 at 10:45 AM, Ralph Castain  wrote:
>>> Okay, then we'll just have to develop a workaround for all those Slurm 
>>> releases where PMI-2 is borked :-(
>>> 
>>> FWIW: I think people misunderstood my statement. I specifically did *not* 
>>> propose to *lose* PMI-2 support. I suggested that we change it to 
>>> "on-by-request" instead of the current "on-by-default" so we wouldn't keep 
>>> getting asked about PMI-2 bugs in Slurm. Once the Slurm implementation 
>>> stabilized, then we could reverse that policy.
>>> 
>>> However, given that both you and Chris appear to prefer to keep it 
>>> "on-by-default", we'll see if we can find a way to detect that PMI-2 is 
>>> broken and then fall back to PMI-1.
>>> 
>>> 
>>> On May 7, 2014, at 7:39 AM, Joshua Ladd  wrote:
>>> 
 Just saw this thread, and I second Chris' observations: at scale we are 
 seeing huge gains in jobstart performance with PMI2 over PMI1. We CANNOT 
 loose this functionality. For competitive reasons, I cannot provide exact 
 numbers, but let's say the difference is in the ballpark of a full 
 order-of-magnitude on 20K ranks versus PMI1. PMI1 is completely 
 unacceptable/unusable at scale. Certainly PMI2 still has scaling issues, 
 but there is no contest between PMI1 and PMI2.  We (MLNX) are actively 
 working to resolve some of the scalability issues in PMI2. 
 
 Josh
 
 Joshua S. Ladd
 Staff Engineer, HPC Software
 Mellanox Technologies
 
 Email: josh...@mellanox.com
 
 
 On Wed, May 7, 2014 at 4:00 AM, Ralph Castain  wrote:
 Interesting - how many 

Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-07 Thread Artem Polyakov
We discussed with Ralph Joshuas concerns and decided to try automatic PMI2
correctness first as it was initially intended. Here is my idea. The
universal way to decide if PMI2 is correct is to compare PMI_Init(..,
, , ...) and PMI2_Init(.., , , ...). Size and rank
should be equal. In this case we proceed with PMI2 finalizing PMI1.
Otherwise we finalize PMI2 and proceed with PMI1.
I need to clarify with SLURM guys if parallel initialization of both PMIs
are legal. If not - we'll do that sequentially.
In other places we'll just use the flag saying what PMI version to use.
Does that sounds reasonable?

2014-05-07 23:10 GMT+07:00 Artem Polyakov :

> That's a good point. There is actually a bunch of modules in ompi, opal
> and orte that has to be duplicated.
>
> среда, 7 мая 2014 г. пользователь Joshua Ladd написал:
>
>>  +1 Sounds like a good idea - but decoupling the two and adding all the
>> right selection mojo might be a bit of a pain. There are several places in
>> OMPI where the distinction between PMI1and PMI2 is made, not only in
>> grpcomm. DB and ESS frameworks off the top of my head.
>>
>> Josh
>>
>>
>> On Wed, May 7, 2014 at 11:48 AM, Artem Polyakov wrote:
>>
>>> Good idea :)!
>>>
>>> среда, 7 мая 2014 г. пользователь Ralph Castain написал:
>>>
>>> Jeff actually had a useful suggestion (gasp!).He proposed that we
>>> separate the PMI-1 and PMI-2 codes into separate components so you could
>>> select them at runtime. Thus, we would build both (assuming both PMI-1 and
>>> 2 libs are found), default to PMI-1, but users could select to try PMI-2.
>>> If the PMI-2 component failed, we would emit a show_help indicating that
>>> they probably have a broken PMI-2 version and should try PMI-1.
>>>
>>> Make sense?
>>> Ralph
>>>
>>> On May 7, 2014, at 8:00 AM, Ralph Castain  wrote:
>>>
>>>
>>> On May 7, 2014, at 7:56 AM, Joshua Ladd  wrote:
>>>
>>> Ah, I see. Sorry for the reactionary comment - but this feature falls
>>> squarely within my "jurisdiction", and we've invested a lot in improving
>>> OMPI jobstart under srun.
>>>
>>> That being said (now that I've taken some deep breaths and carefully
>>> read your original email :)), what you're proposing isn't a bad idea. I
>>> think it would be good to maybe add a "--with-pmi2" flag to configure since
>>> "--with-pmi" automagically uses PMI2 if it finds the header and lib. This
>>> way, we could experiment with PMI1/PMI2 without having to rebuild SLURM or
>>> hack the installation.
>>>
>>>
>>> That would be a much simpler solution than what Artem proposed
>>> (off-list) where we would try PMI2 and then if it didn't work try to figure
>>> out how to fall back to PMI1. I'll add this for now, and if Artem wants to
>>> try his more automagic solution and can make it work, then we can
>>> reconsider that option.
>>>
>>> Thanks
>>> Ralph
>>>
>>>
>>> Josh
>>>
>>>
>>> On Wed, May 7, 2014 at 10:45 AM, Ralph Castain  wrote:
>>>
>>> Okay, then we'll just have to develop a workaround for all those Slurm
>>> releases where PMI-2 is borked :-(
>>>
>>> FWIW: I think people misunderstood my statement. I specifically did
>>> *not* propose to *lose* PMI-2 support. I suggested that we change it to
>>> "on-by-request" instead of the current "on-by-default" so we wouldn't keep
>>> getting asked about PMI-2 bugs in Slurm. Once the Slurm implementation
>>> stabilized, then we could reverse that policy.
>>>
>>> However, given that both you and Chris appear to prefer to keep it
>>> "on-by-default", we'll see if we can find a way to detect that PMI-2 is
>>> broken and then fall back to PMI-1.
>>>
>>>
>>> On May 7, 2014, at 7:39 AM, Joshua Ladd  wrote:
>>>
>>> Just saw this thread, and I second Chris' observations: at scale we are
>>> seeing huge gains in jobstart performance with PMI2 over PMI1. We
>>> *CANNOT* loose this functionality. For competitive reasons, I cannot
>>> provide exact numbers, but let's say the difference is in the ballpark of a
>>> full order-of-magnitude on 20K ranks versus PMI1. PMI1 is completely
>>> unacceptable/unusable at scale. Certainly PMI2 still has scaling issues,
>>> but there is no contest between PMI1 and PMI2.  We (MLNX) are actively
>>> working to resolve some of the scalability issues in PMI2.
>>>
>>> Josh
>>>
>>> Joshua S. Ladd
>>> Staff Engineer, HPC Software
>>> Mellanox Technologies
>>>
>>> Email: josh...@mellanox.com
>>>
>>>
>>> On Wed, May 7, 2014 at 4:00 AM, Ralph Castain  wrote:
>>>
>>> Interesting - how many nodes were involved? As I said, the bad scaling
>>> becomes more evident at a fairly high node count.
>>>
>>> On May 7, 2014, at 12:07 AM, Christopher Samuel 
>>> wrote:
>>>
>>> > -BEGIN PGP SIGNED MESSAGE-
>>> > Hash: SHA1
>>> >
>>> > Hiya Ralph,
>>> >
>>> > On 07/05/14 14:49, Ralph Castain wrote:
>>> >
>>> >> I should have looked closer to see the numbers you 

Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-07 Thread Ralph Castain
Yeah, we'll want to move some of it into common - but a lot of that was already 
done, so I think it won't be that hard. Will explore


On May 7, 2014, at 9:00 AM, Joshua Ladd  wrote:

> +1 Sounds like a good idea - but decoupling the two and adding all the right 
> selection mojo might be a bit of a pain. There are several places in OMPI 
> where the distinction between PMI1and PMI2 is made, not only in grpcomm. DB 
> and ESS frameworks off the top of my head.
> 
> Josh
> 
> 
> On Wed, May 7, 2014 at 11:48 AM, Artem Polyakov  wrote:
> Good idea :)!
> 
> среда, 7 мая 2014 г. пользователь Ralph Castain написал:
> 
> Jeff actually had a useful suggestion (gasp!).He proposed that we separate 
> the PMI-1 and PMI-2 codes into separate components so you could select them 
> at runtime. Thus, we would build both (assuming both PMI-1 and 2 libs are 
> found), default to PMI-1, but users could select to try PMI-2. If the PMI-2 
> component failed, we would emit a show_help indicating that they probably 
> have a broken PMI-2 version and should try PMI-1.
> 
> Make sense?
> Ralph
> 
> On May 7, 2014, at 8:00 AM, Ralph Castain  wrote:
> 
>> 
>> On May 7, 2014, at 7:56 AM, Joshua Ladd  wrote:
>> 
>>> Ah, I see. Sorry for the reactionary comment - but this feature falls 
>>> squarely within my "jurisdiction", and we've invested a lot in improving 
>>> OMPI jobstart under srun. 
>>> 
>>> That being said (now that I've taken some deep breaths and carefully read 
>>> your original email :)), what you're proposing isn't a bad idea. I think it 
>>> would be good to maybe add a "--with-pmi2" flag to configure since 
>>> "--with-pmi" automagically uses PMI2 if it finds the header and lib. This 
>>> way, we could experiment with PMI1/PMI2 without having to rebuild SLURM or 
>>> hack the installation. 
>> 
>> That would be a much simpler solution than what Artem proposed (off-list) 
>> where we would try PMI2 and then if it didn't work try to figure out how to 
>> fall back to PMI1. I'll add this for now, and if Artem wants to try his more 
>> automagic solution and can make it work, then we can reconsider that option.
>> 
>> Thanks
>> Ralph
>> 
>>> 
>>> Josh  
>>> 
>>> 
>>> On Wed, May 7, 2014 at 10:45 AM, Ralph Castain  wrote:
>>> Okay, then we'll just have to develop a workaround for all those Slurm 
>>> releases where PMI-2 is borked :-(
>>> 
>>> FWIW: I think people misunderstood my statement. I specifically did *not* 
>>> propose to *lose* PMI-2 support. I suggested that we change it to 
>>> "on-by-request" instead of the current "on-by-default" so we wouldn't keep 
>>> getting asked about PMI-2 bugs in Slurm. Once the Slurm implementation 
>>> stabilized, then we could reverse that policy.
>>> 
>>> However, given that both you and Chris appear to prefer to keep it 
>>> "on-by-default", we'll see if we can find a way to detect that PMI-2 is 
>>> broken and then fall back to PMI-1.
>>> 
>>> 
>>> On May 7, 2014, at 7:39 AM, Joshua Ladd  wrote:
>>> 
 Just saw this thread, and I second Chris' observations: at scale we are 
 seeing huge gains in jobstart performance with PMI2 over PMI1. We CANNOT 
 loose this functionality. For competitive reasons, I cannot provide exact 
 numbers, but let's say the difference is in the ballpark of a full 
 order-of-magnitude on 20K ranks versus PMI1. PMI1 is completely 
 unacceptable/unusable at scale. Certainly PMI2 still has scaling issues, 
 but there is no contest between PMI1 and PMI2.  We (MLNX) are actively 
 working to resolve some of the scalability issues in PMI2. 
 
 Josh
 
 Joshua S. Ladd
 Staff Engineer, HPC Software
 Mellanox Technologies
 
 Email: josh...@mellanox.com
 
 
 On Wed, May 7, 2014 at 4:00 AM, Ralph Castain  wrote:
 Interesting - how many nodes were involved? As I said, the bad scaling 
 becomes more evident at a fairly high node count.
 
 On May 7, 2014, at 12:07 AM, Christopher Samuel  
 wrote:
 
 > -BEGIN PGP SIGNED MESSAGE-
 > Hash: SHA1
 >
 > Hiya Ralph,
 >
 > On 07/05/14 14:49, Ralph Castain wrote:
 >
 >> I should have looked closer to see the numbers you posted, Chris -
 >> those include time for MPI wireup. So what you are seeing is that
 >> mpirun is much more efficient at exchanging the MPI endpoint info
 >> than PMI. I suspect that PMI2 is not much better as the primary
 >> reason for the difference is that mpriun sends blobs, while PMI
 >> requires that everything b
> 
> 
> -- 
> С Уважением, Поляков Артем Юрьевич
> Best regards, Artem Y. Polyakov
> 
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> 

Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-07 Thread Joshua Ladd
+1 Sounds like a good idea - but decoupling the two and adding all the
right selection mojo might be a bit of a pain. There are several places in
OMPI where the distinction between PMI1and PMI2 is made, not only in
grpcomm. DB and ESS frameworks off the top of my head.

Josh


On Wed, May 7, 2014 at 11:48 AM, Artem Polyakov  wrote:

> Good idea :)!
>
> среда, 7 мая 2014 г. пользователь Ralph Castain написал:
>
> Jeff actually had a useful suggestion (gasp!).He proposed that we separate
>> the PMI-1 and PMI-2 codes into separate components so you could select them
>> at runtime. Thus, we would build both (assuming both PMI-1 and 2 libs are
>> found), default to PMI-1, but users could select to try PMI-2. If the PMI-2
>> component failed, we would emit a show_help indicating that they probably
>> have a broken PMI-2 version and should try PMI-1.
>>
>> Make sense?
>> Ralph
>>
>> On May 7, 2014, at 8:00 AM, Ralph Castain  wrote:
>>
>>
>> On May 7, 2014, at 7:56 AM, Joshua Ladd  wrote:
>>
>> Ah, I see. Sorry for the reactionary comment - but this feature falls
>> squarely within my "jurisdiction", and we've invested a lot in improving
>> OMPI jobstart under srun.
>>
>> That being said (now that I've taken some deep breaths and carefully read
>> your original email :)), what you're proposing isn't a bad idea. I think it
>> would be good to maybe add a "--with-pmi2" flag to configure since
>> "--with-pmi" automagically uses PMI2 if it finds the header and lib. This
>> way, we could experiment with PMI1/PMI2 without having to rebuild SLURM or
>> hack the installation.
>>
>>
>> That would be a much simpler solution than what Artem proposed (off-list)
>> where we would try PMI2 and then if it didn't work try to figure out how to
>> fall back to PMI1. I'll add this for now, and if Artem wants to try his
>> more automagic solution and can make it work, then we can reconsider that
>> option.
>>
>> Thanks
>> Ralph
>>
>>
>> Josh
>>
>>
>> On Wed, May 7, 2014 at 10:45 AM, Ralph Castain  wrote:
>>
>> Okay, then we'll just have to develop a workaround for all those Slurm
>> releases where PMI-2 is borked :-(
>>
>> FWIW: I think people misunderstood my statement. I specifically did *not*
>> propose to *lose* PMI-2 support. I suggested that we change it to
>> "on-by-request" instead of the current "on-by-default" so we wouldn't keep
>> getting asked about PMI-2 bugs in Slurm. Once the Slurm implementation
>> stabilized, then we could reverse that policy.
>>
>> However, given that both you and Chris appear to prefer to keep it
>> "on-by-default", we'll see if we can find a way to detect that PMI-2 is
>> broken and then fall back to PMI-1.
>>
>>
>> On May 7, 2014, at 7:39 AM, Joshua Ladd  wrote:
>>
>> Just saw this thread, and I second Chris' observations: at scale we are
>> seeing huge gains in jobstart performance with PMI2 over PMI1. We
>> *CANNOT* loose this functionality. For competitive reasons, I cannot
>> provide exact numbers, but let's say the difference is in the ballpark of a
>> full order-of-magnitude on 20K ranks versus PMI1. PMI1 is completely
>> unacceptable/unusable at scale. Certainly PMI2 still has scaling issues,
>> but there is no contest between PMI1 and PMI2.  We (MLNX) are actively
>> working to resolve some of the scalability issues in PMI2.
>>
>> Josh
>>
>> Joshua S. Ladd
>> Staff Engineer, HPC Software
>> Mellanox Technologies
>>
>> Email: josh...@mellanox.com
>>
>>
>> On Wed, May 7, 2014 at 4:00 AM, Ralph Castain  wrote:
>>
>> Interesting - how many nodes were involved? As I said, the bad scaling
>> becomes more evident at a fairly high node count.
>>
>> On May 7, 2014, at 12:07 AM, Christopher Samuel 
>> wrote:
>>
>> > -BEGIN PGP SIGNED MESSAGE-
>> > Hash: SHA1
>> >
>> > Hiya Ralph,
>> >
>> > On 07/05/14 14:49, Ralph Castain wrote:
>> >
>> >> I should have looked closer to see the numbers you posted, Chris -
>> >> those include time for MPI wireup. So what you are seeing is that
>> >> mpirun is much more efficient at exchanging the MPI endpoint info
>> >> than PMI. I suspect that PMI2 is not much better as the primary
>> >> reason for the difference is that mpriun sends blobs, while PMI
>> >> requires that everything b
>>
>>
>
> --
> С Уважением, Поляков Артем Юрьевич
> Best regards, Artem Y. Polyakov
>
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post:
> http://www.open-mpi.org/community/lists/devel/2014/05/14716.php
>


Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-07 Thread Artem Polyakov
Good idea :)!

среда, 7 мая 2014 г. пользователь Ralph Castain написал:

> Jeff actually had a useful suggestion (gasp!).He proposed that we separate
> the PMI-1 and PMI-2 codes into separate components so you could select them
> at runtime. Thus, we would build both (assuming both PMI-1 and 2 libs are
> found), default to PMI-1, but users could select to try PMI-2. If the PMI-2
> component failed, we would emit a show_help indicating that they probably
> have a broken PMI-2 version and should try PMI-1.
>
> Make sense?
> Ralph
>
> On May 7, 2014, at 8:00 AM, Ralph Castain  wrote:
>
>
> On May 7, 2014, at 7:56 AM, Joshua Ladd  wrote:
>
> Ah, I see. Sorry for the reactionary comment - but this feature falls
> squarely within my "jurisdiction", and we've invested a lot in improving
> OMPI jobstart under srun.
>
> That being said (now that I've taken some deep breaths and carefully read
> your original email :)), what you're proposing isn't a bad idea. I think it
> would be good to maybe add a "--with-pmi2" flag to configure since
> "--with-pmi" automagically uses PMI2 if it finds the header and lib. This
> way, we could experiment with PMI1/PMI2 without having to rebuild SLURM or
> hack the installation.
>
>
> That would be a much simpler solution than what Artem proposed (off-list)
> where we would try PMI2 and then if it didn't work try to figure out how to
> fall back to PMI1. I'll add this for now, and if Artem wants to try his
> more automagic solution and can make it work, then we can reconsider that
> option.
>
> Thanks
> Ralph
>
>
> Josh
>
>
> On Wed, May 7, 2014 at 10:45 AM, Ralph Castain  wrote:
>
> Okay, then we'll just have to develop a workaround for all those Slurm
> releases where PMI-2 is borked :-(
>
> FWIW: I think people misunderstood my statement. I specifically did *not*
> propose to *lose* PMI-2 support. I suggested that we change it to
> "on-by-request" instead of the current "on-by-default" so we wouldn't keep
> getting asked about PMI-2 bugs in Slurm. Once the Slurm implementation
> stabilized, then we could reverse that policy.
>
> However, given that both you and Chris appear to prefer to keep it
> "on-by-default", we'll see if we can find a way to detect that PMI-2 is
> broken and then fall back to PMI-1.
>
>
> On May 7, 2014, at 7:39 AM, Joshua Ladd  wrote:
>
> Just saw this thread, and I second Chris' observations: at scale we are
> seeing huge gains in jobstart performance with PMI2 over PMI1. We 
> *CANNOT*loose this functionality. For competitive reasons, I cannot provide 
> exact
> numbers, but let's say the difference is in the ballpark of a full
> order-of-magnitude on 20K ranks versus PMI1. PMI1 is completely
> unacceptable/unusable at scale. Certainly PMI2 still has scaling issues,
> but there is no contest between PMI1 and PMI2.  We (MLNX) are actively
> working to resolve some of the scalability issues in PMI2.
>
> Josh
>
> Joshua S. Ladd
> Staff Engineer, HPC Software
> Mellanox Technologies
>
> Email: josh...@mellanox.com
>
>
> On Wed, May 7, 2014 at 4:00 AM, Ralph Castain  wrote:
>
> Interesting - how many nodes were involved? As I said, the bad scaling
> becomes more evident at a fairly high node count.
>
> On May 7, 2014, at 12:07 AM, Christopher Samuel 
> wrote:
>
> > -BEGIN PGP SIGNED MESSAGE-
> > Hash: SHA1
> >
> > Hiya Ralph,
> >
> > On 07/05/14 14:49, Ralph Castain wrote:
> >
> >> I should have looked closer to see the numbers you posted, Chris -
> >> those include time for MPI wireup. So what you are seeing is that
> >> mpirun is much more efficient at exchanging the MPI endpoint info
> >> than PMI. I suspect that PMI2 is not much better as the primary
> >> reason for the difference is that mpriun sends blobs, while PMI
> >> requires that everything b
>
>

-- 
С Уважением, Поляков Артем Юрьевич
Best regards, Artem Y. Polyakov


Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-07 Thread Ralph Castain
Jeff actually had a useful suggestion (gasp!).He proposed that we separate the 
PMI-1 and PMI-2 codes into separate components so you could select them at 
runtime. Thus, we would build both (assuming both PMI-1 and 2 libs are found), 
default to PMI-1, but users could select to try PMI-2. If the PMI-2 component 
failed, we would emit a show_help indicating that they probably have a broken 
PMI-2 version and should try PMI-1.

Make sense?
Ralph

On May 7, 2014, at 8:00 AM, Ralph Castain  wrote:

> 
> On May 7, 2014, at 7:56 AM, Joshua Ladd  wrote:
> 
>> Ah, I see. Sorry for the reactionary comment - but this feature falls 
>> squarely within my "jurisdiction", and we've invested a lot in improving 
>> OMPI jobstart under srun. 
>> 
>> That being said (now that I've taken some deep breaths and carefully read 
>> your original email :)), what you're proposing isn't a bad idea. I think it 
>> would be good to maybe add a "--with-pmi2" flag to configure since 
>> "--with-pmi" automagically uses PMI2 if it finds the header and lib. This 
>> way, we could experiment with PMI1/PMI2 without having to rebuild SLURM or 
>> hack the installation. 
> 
> That would be a much simpler solution than what Artem proposed (off-list) 
> where we would try PMI2 and then if it didn't work try to figure out how to 
> fall back to PMI1. I'll add this for now, and if Artem wants to try his more 
> automagic solution and can make it work, then we can reconsider that option.
> 
> Thanks
> Ralph
> 
>> 
>> Josh  
>> 
>> 
>> On Wed, May 7, 2014 at 10:45 AM, Ralph Castain  wrote:
>> Okay, then we'll just have to develop a workaround for all those Slurm 
>> releases where PMI-2 is borked :-(
>> 
>> FWIW: I think people misunderstood my statement. I specifically did *not* 
>> propose to *lose* PMI-2 support. I suggested that we change it to 
>> "on-by-request" instead of the current "on-by-default" so we wouldn't keep 
>> getting asked about PMI-2 bugs in Slurm. Once the Slurm implementation 
>> stabilized, then we could reverse that policy.
>> 
>> However, given that both you and Chris appear to prefer to keep it 
>> "on-by-default", we'll see if we can find a way to detect that PMI-2 is 
>> broken and then fall back to PMI-1.
>> 
>> 
>> On May 7, 2014, at 7:39 AM, Joshua Ladd  wrote:
>> 
>>> Just saw this thread, and I second Chris' observations: at scale we are 
>>> seeing huge gains in jobstart performance with PMI2 over PMI1. We CANNOT 
>>> loose this functionality. For competitive reasons, I cannot provide exact 
>>> numbers, but let's say the difference is in the ballpark of a full 
>>> order-of-magnitude on 20K ranks versus PMI1. PMI1 is completely 
>>> unacceptable/unusable at scale. Certainly PMI2 still has scaling issues, 
>>> but there is no contest between PMI1 and PMI2.  We (MLNX) are actively 
>>> working to resolve some of the scalability issues in PMI2. 
>>> 
>>> Josh
>>> 
>>> Joshua S. Ladd
>>> Staff Engineer, HPC Software
>>> Mellanox Technologies
>>> 
>>> Email: josh...@mellanox.com
>>> 
>>> 
>>> On Wed, May 7, 2014 at 4:00 AM, Ralph Castain  wrote:
>>> Interesting - how many nodes were involved? As I said, the bad scaling 
>>> becomes more evident at a fairly high node count.
>>> 
>>> On May 7, 2014, at 12:07 AM, Christopher Samuel  
>>> wrote:
>>> 
>>> > -BEGIN PGP SIGNED MESSAGE-
>>> > Hash: SHA1
>>> >
>>> > Hiya Ralph,
>>> >
>>> > On 07/05/14 14:49, Ralph Castain wrote:
>>> >
>>> >> I should have looked closer to see the numbers you posted, Chris -
>>> >> those include time for MPI wireup. So what you are seeing is that
>>> >> mpirun is much more efficient at exchanging the MPI endpoint info
>>> >> than PMI. I suspect that PMI2 is not much better as the primary
>>> >> reason for the difference is that mpriun sends blobs, while PMI
>>> >> requires that everything be encoded into strings and sent in little
>>> >> pieces.
>>> >>
>>> >> Hence, mpirun can exchange the endpoint info (the dreaded "modex"
>>> >> operation) much faster, and MPI_Init completes faster. Rest of the
>>> >> computation should be the same, so long compute apps will see the
>>> >> difference narrow considerably.
>>> >
>>> > Unfortunately it looks like I had an enthusiastic cleanup at some point
>>> > and so I cannot find the out files from those runs at the moment, but
>>> > I did find some comparisons from around that time.
>>> >
>>> > This first pair are comparing running NAMD with OMPI 1.7.3a1r29103
>>> > run with mpirun and srun successively from inside the same Slurm job.
>>> >
>>> > mpirun namd2 macpf.conf
>>> > srun --mpi=pmi2 namd2 macpf.conf
>>> >
>>> > Firstly the mpirun output (grep'ing the interesting bits):
>>> >
>>> > Charm++> Running on MPI version: 2.1
>>> > Info: Benchmark time: 512 CPUs 0.0959179 s/step 0.555081 days/ns 1055.19 
>>> > MB memory
>>> > Info: Benchmark time: 512 CPUs 0.0929002 s/step 

Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-07 Thread Moody, Adam T.
Hi Josh,
Are your changes to OMPI or SLURM's PMI2 implementation?  Do you plan to push 
those changes back upstream?
-Adam



From: devel [devel-boun...@open-mpi.org] on behalf of Joshua Ladd 
[jladd.m...@gmail.com]
Sent: Wednesday, May 07, 2014 7:56 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is 
specifically requested

Ah, I see. Sorry for the reactionary comment - but this feature falls squarely 
within my "jurisdiction", and we've invested a lot in improving OMPI jobstart 
under srun.

That being said (now that I've taken some deep breaths and carefully read your 
original email :)), what you're proposing isn't a bad idea. I think it would be 
good to maybe add a "--with-pmi2" flag to configure since "--with-pmi" 
automagically uses PMI2 if it finds the header and lib. This way, we could 
experiment with PMI1/PMI2 without having to rebuild SLURM or hack the 
installation.

Josh


On Wed, May 7, 2014 at 10:45 AM, Ralph Castain 
<r...@open-mpi.org<mailto:r...@open-mpi.org>> wrote:
Okay, then we'll just have to develop a workaround for all those Slurm releases 
where PMI-2 is borked :-(

FWIW: I think people misunderstood my statement. I specifically did *not* 
propose to *lose* PMI-2 support. I suggested that we change it to 
"on-by-request" instead of the current "on-by-default" so we wouldn't keep 
getting asked about PMI-2 bugs in Slurm. Once the Slurm implementation 
stabilized, then we could reverse that policy.

However, given that both you and Chris appear to prefer to keep it 
"on-by-default", we'll see if we can find a way to detect that PMI-2 is broken 
and then fall back to PMI-1.


On May 7, 2014, at 7:39 AM, Joshua Ladd 
<jladd.m...@gmail.com<mailto:jladd.m...@gmail.com>> wrote:

Just saw this thread, and I second Chris' observations: at scale we are seeing 
huge gains in jobstart performance with PMI2 over PMI1. We CANNOT loose this 
functionality. For competitive reasons, I cannot provide exact numbers, but 
let's say the difference is in the ballpark of a full order-of-magnitude on 20K 
ranks versus PMI1. PMI1 is completely unacceptable/unusable at scale. Certainly 
PMI2 still has scaling issues, but there is no contest between PMI1 and PMI2.  
We (MLNX) are actively working to resolve some of the scalability issues in 
PMI2.

Josh

Joshua S. Ladd
Staff Engineer, HPC Software
Mellanox Technologies

Email: josh...@mellanox.com<mailto:josh...@mellanox.com>


On Wed, May 7, 2014 at 4:00 AM, Ralph Castain 
<r...@open-mpi.org<mailto:r...@open-mpi.org>> wrote:
Interesting - how many nodes were involved? As I said, the bad scaling becomes 
more evident at a fairly high node count.

On May 7, 2014, at 12:07 AM, Christopher Samuel 
<sam...@unimelb.edu.au<mailto:sam...@unimelb.edu.au>> wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> Hiya Ralph,
>
> On 07/05/14 14:49, Ralph Castain wrote:
>
>> I should have looked closer to see the numbers you posted, Chris -
>> those include time for MPI wireup. So what you are seeing is that
>> mpirun is much more efficient at exchanging the MPI endpoint info
>> than PMI. I suspect that PMI2 is not much better as the primary
>> reason for the difference is that mpriun sends blobs, while PMI
>> requires that everything be encoded into strings and sent in little
>> pieces.
>>
>> Hence, mpirun can exchange the endpoint info (the dreaded "modex"
>> operation) much faster, and MPI_Init completes faster. Rest of the
>> computation should be the same, so long compute apps will see the
>> difference narrow considerably.
>
> Unfortunately it looks like I had an enthusiastic cleanup at some point
> and so I cannot find the out files from those runs at the moment, but
> I did find some comparisons from around that time.
>
> This first pair are comparing running NAMD with OMPI 1.7.3a1r29103
> run with mpirun and srun successively from inside the same Slurm job.
>
> mpirun namd2 macpf.conf
> srun --mpi=pmi2 namd2 macpf.conf
>
> Firstly the mpirun output (grep'ing the interesting bits):
>
> Charm++> Running on MPI version: 2.1
> Info: Benchmark time: 512 CPUs 0.0959179 s/step 0.555081 days/ns 1055.19 MB 
> memory
> Info: Benchmark time: 512 CPUs 0.0929002 s/step 0.537617 days/ns 1055.19 MB 
> memory
> Info: Benchmark time: 512 CPUs 0.0727373 s/step 0.420933 days/ns 1055.19 MB 
> memory
> Info: Benchmark time: 512 CPUs 0.0779532 s/step 0.451118 days/ns 1055.19 MB 
> memory
> Info: Benchmark time: 512 CPUs 0.0785246 s/step 0.454425 days/ns 1055.19 MB 
> memory
> WallClock: 1403.388550  CPUTime: 1403.388550  Memory: 1119.085938 MB
>
> Now the srun output:
>
> Charm++> Running o

Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-07 Thread Moody, Adam T.
Thanks, Chris.
-Adam

From: devel [devel-boun...@open-mpi.org] on behalf of Christopher Samuel 
[sam...@unimelb.edu.au]
Sent: Wednesday, May 07, 2014 12:07 AM
To: de...@open-mpi.org
Subject: Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is 
specifically requested

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hiya Ralph,

On 07/05/14 14:49, Ralph Castain wrote:

> I should have looked closer to see the numbers you posted, Chris -
> those include time for MPI wireup. So what you are seeing is that
> mpirun is much more efficient at exchanging the MPI endpoint info
> than PMI. I suspect that PMI2 is not much better as the primary
> reason for the difference is that mpriun sends blobs, while PMI
> requires that everything be encoded into strings and sent in little
> pieces.
>
> Hence, mpirun can exchange the endpoint info (the dreaded "modex"
> operation) much faster, and MPI_Init completes faster. Rest of the
> computation should be the same, so long compute apps will see the
> difference narrow considerably.

Unfortunately it looks like I had an enthusiastic cleanup at some point
and so I cannot find the out files from those runs at the moment, but
I did find some comparisons from around that time.

This first pair are comparing running NAMD with OMPI 1.7.3a1r29103
run with mpirun and srun successively from inside the same Slurm job.

mpirun namd2 macpf.conf
srun --mpi=pmi2 namd2 macpf.conf

Firstly the mpirun output (grep'ing the interesting bits):

Charm++> Running on MPI version: 2.1
Info: Benchmark time: 512 CPUs 0.0959179 s/step 0.555081 days/ns 1055.19 MB 
memory
Info: Benchmark time: 512 CPUs 0.0929002 s/step 0.537617 days/ns 1055.19 MB 
memory
Info: Benchmark time: 512 CPUs 0.0727373 s/step 0.420933 days/ns 1055.19 MB 
memory
Info: Benchmark time: 512 CPUs 0.0779532 s/step 0.451118 days/ns 1055.19 MB 
memory
Info: Benchmark time: 512 CPUs 0.0785246 s/step 0.454425 days/ns 1055.19 MB 
memory
WallClock: 1403.388550  CPUTime: 1403.388550  Memory: 1119.085938 MB

Now the srun output:

Charm++> Running on MPI version: 2.1
Info: Benchmark time: 512 CPUs 0.0906865 s/step 0.524806 days/ns 1036.75 MB 
memory
Info: Benchmark time: 512 CPUs 0.0874809 s/step 0.506255 days/ns 1036.75 MB 
memory
Info: Benchmark time: 512 CPUs 0.0746328 s/step 0.431903 days/ns 1036.75 MB 
memory
Info: Benchmark time: 512 CPUs 0.0726161 s/step 0.420232 days/ns 1036.75 MB 
memory
Info: Benchmark time: 512 CPUs 0.0710574 s/step 0.411212 days/ns 1036.75 MB 
memory
WallClock: 1230.784424  CPUTime: 1230.784424  Memory: 1100.648438 MB


The next two pairs are first launched using mpirun from 1.6.x and then with srun
from 1.7.3a1r29103.  Again each pair inside the same Slurm job with the same 
inputs.

First pair mpirun:

Charm++> Running on MPI version: 2.1
Info: Benchmark time: 64 CPUs 0.410424 s/step 2.37514 days/ns 909.57 MB memory
Info: Benchmark time: 64 CPUs 0.392106 s/step 2.26913 days/ns 909.57 MB memory
Info: Benchmark time: 64 CPUs 0.313136 s/step 1.81213 days/ns 909.57 MB memory
Info: Benchmark time: 64 CPUs 0.316792 s/step 1.83329 days/ns 909.57 MB memory
Info: Benchmark time: 64 CPUs 0.313867 s/step 1.81636 days/ns 909.57 MB memory
WallClock: 8341.524414  CPUTime: 8341.524414  Memory: 975.015625 MB

First pair srun:

Charm++> Running on MPI version: 2.1
Info: Benchmark time: 64 CPUs 0.341967 s/step 1.97897 days/ns 903.883 MB memory
Info: Benchmark time: 64 CPUs 0.339644 s/step 1.96553 days/ns 903.883 MB memory
Info: Benchmark time: 64 CPUs 0.284424 s/step 1.64597 days/ns 903.883 MB memory
Info: Benchmark time: 64 CPUs 0.28115 s/step 1.62702 days/ns 903.883 MB memory
Info: Benchmark time: 64 CPUs 0.279536 s/step 1.61769 days/ns 903.883 MB memory
WallClock: 7476.643555  CPUTime: 7476.643555  Memory: 968.867188 MB


Second pair mpirun:

Charm++> Running on MPI version: 2.1
Info: Benchmark time: 64 CPUs 0.366327 s/step 2.11995 days/ns 939.527 MB memory
Info: Benchmark time: 64 CPUs 0.359805 s/step 2.0822 days/ns 939.527 MB memory
Info: Benchmark time: 64 CPUs 0.292342 s/step 1.69179 days/ns 939.527 MB memory
Info: Benchmark time: 64 CPUs 0.293499 s/step 1.69849 days/ns 939.527 MB memory
Info: Benchmark time: 64 CPUs 0.292355 s/step 1.69187 days/ns 939.527 MB memory
WallClock: 7842.831543  CPUTime: 7842.831543  Memory: 1004.050781 MB

Second pair srun:

Charm++> Running on MPI version: 2.1
Info: Benchmark time: 64 CPUs 0.347864 s/step 2.0131 days/ns 904.91 MB memory
Info: Benchmark time: 64 CPUs 0.346367 s/step 2.00444 days/ns 904.91 MB memory
Info: Benchmark time: 64 CPUs 0.29007 s/step 1.67865 days/ns 904.91 MB memory
Info: Benchmark time: 64 CPUs 0.279447 s/step 1.61717 days/ns 904.91 MB memory
Info: Benchmark time: 64 CPUs 0.280824 s/step 1.62514 days/ns 904.91 MB memory
WallClock: 7522.677246  CPUTime: 7522.677246  Memory: 969.433594 MB


So to me it looks like (for NAMD on our system at least) that

Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-07 Thread Ralph Castain

On May 7, 2014, at 7:56 AM, Joshua Ladd  wrote:

> Ah, I see. Sorry for the reactionary comment - but this feature falls 
> squarely within my "jurisdiction", and we've invested a lot in improving OMPI 
> jobstart under srun. 
> 
> That being said (now that I've taken some deep breaths and carefully read 
> your original email :)), what you're proposing isn't a bad idea. I think it 
> would be good to maybe add a "--with-pmi2" flag to configure since 
> "--with-pmi" automagically uses PMI2 if it finds the header and lib. This 
> way, we could experiment with PMI1/PMI2 without having to rebuild SLURM or 
> hack the installation. 

That would be a much simpler solution than what Artem proposed (off-list) where 
we would try PMI2 and then if it didn't work try to figure out how to fall back 
to PMI1. I'll add this for now, and if Artem wants to try his more automagic 
solution and can make it work, then we can reconsider that option.

Thanks
Ralph

> 
> Josh  
> 
> 
> On Wed, May 7, 2014 at 10:45 AM, Ralph Castain  wrote:
> Okay, then we'll just have to develop a workaround for all those Slurm 
> releases where PMI-2 is borked :-(
> 
> FWIW: I think people misunderstood my statement. I specifically did *not* 
> propose to *lose* PMI-2 support. I suggested that we change it to 
> "on-by-request" instead of the current "on-by-default" so we wouldn't keep 
> getting asked about PMI-2 bugs in Slurm. Once the Slurm implementation 
> stabilized, then we could reverse that policy.
> 
> However, given that both you and Chris appear to prefer to keep it 
> "on-by-default", we'll see if we can find a way to detect that PMI-2 is 
> broken and then fall back to PMI-1.
> 
> 
> On May 7, 2014, at 7:39 AM, Joshua Ladd  wrote:
> 
>> Just saw this thread, and I second Chris' observations: at scale we are 
>> seeing huge gains in jobstart performance with PMI2 over PMI1. We CANNOT 
>> loose this functionality. For competitive reasons, I cannot provide exact 
>> numbers, but let's say the difference is in the ballpark of a full 
>> order-of-magnitude on 20K ranks versus PMI1. PMI1 is completely 
>> unacceptable/unusable at scale. Certainly PMI2 still has scaling issues, but 
>> there is no contest between PMI1 and PMI2.  We (MLNX) are actively working 
>> to resolve some of the scalability issues in PMI2. 
>> 
>> Josh
>> 
>> Joshua S. Ladd
>> Staff Engineer, HPC Software
>> Mellanox Technologies
>> 
>> Email: josh...@mellanox.com
>> 
>> 
>> On Wed, May 7, 2014 at 4:00 AM, Ralph Castain  wrote:
>> Interesting - how many nodes were involved? As I said, the bad scaling 
>> becomes more evident at a fairly high node count.
>> 
>> On May 7, 2014, at 12:07 AM, Christopher Samuel  
>> wrote:
>> 
>> > -BEGIN PGP SIGNED MESSAGE-
>> > Hash: SHA1
>> >
>> > Hiya Ralph,
>> >
>> > On 07/05/14 14:49, Ralph Castain wrote:
>> >
>> >> I should have looked closer to see the numbers you posted, Chris -
>> >> those include time for MPI wireup. So what you are seeing is that
>> >> mpirun is much more efficient at exchanging the MPI endpoint info
>> >> than PMI. I suspect that PMI2 is not much better as the primary
>> >> reason for the difference is that mpriun sends blobs, while PMI
>> >> requires that everything be encoded into strings and sent in little
>> >> pieces.
>> >>
>> >> Hence, mpirun can exchange the endpoint info (the dreaded "modex"
>> >> operation) much faster, and MPI_Init completes faster. Rest of the
>> >> computation should be the same, so long compute apps will see the
>> >> difference narrow considerably.
>> >
>> > Unfortunately it looks like I had an enthusiastic cleanup at some point
>> > and so I cannot find the out files from those runs at the moment, but
>> > I did find some comparisons from around that time.
>> >
>> > This first pair are comparing running NAMD with OMPI 1.7.3a1r29103
>> > run with mpirun and srun successively from inside the same Slurm job.
>> >
>> > mpirun namd2 macpf.conf
>> > srun --mpi=pmi2 namd2 macpf.conf
>> >
>> > Firstly the mpirun output (grep'ing the interesting bits):
>> >
>> > Charm++> Running on MPI version: 2.1
>> > Info: Benchmark time: 512 CPUs 0.0959179 s/step 0.555081 days/ns 1055.19 
>> > MB memory
>> > Info: Benchmark time: 512 CPUs 0.0929002 s/step 0.537617 days/ns 1055.19 
>> > MB memory
>> > Info: Benchmark time: 512 CPUs 0.0727373 s/step 0.420933 days/ns 1055.19 
>> > MB memory
>> > Info: Benchmark time: 512 CPUs 0.0779532 s/step 0.451118 days/ns 1055.19 
>> > MB memory
>> > Info: Benchmark time: 512 CPUs 0.0785246 s/step 0.454425 days/ns 1055.19 
>> > MB memory
>> > WallClock: 1403.388550  CPUTime: 1403.388550  Memory: 1119.085938 MB
>> >
>> > Now the srun output:
>> >
>> > Charm++> Running on MPI version: 2.1
>> > Info: Benchmark time: 512 CPUs 0.0906865 s/step 0.524806 days/ns 1036.75 
>> > MB memory
>> > Info: Benchmark time: 512 CPUs 0.0874809 s/step 0.506255 

Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-07 Thread Joshua Ladd
Ah, I see. Sorry for the reactionary comment - but this feature falls
squarely within my "jurisdiction", and we've invested a lot in improving
OMPI jobstart under srun.

That being said (now that I've taken some deep breaths and carefully read
your original email :)), what you're proposing isn't a bad idea. I think it
would be good to maybe add a "--with-pmi2" flag to configure since
"--with-pmi" automagically uses PMI2 if it finds the header and lib. This
way, we could experiment with PMI1/PMI2 without having to rebuild SLURM or
hack the installation.

Josh


On Wed, May 7, 2014 at 10:45 AM, Ralph Castain  wrote:

> Okay, then we'll just have to develop a workaround for all those Slurm
> releases where PMI-2 is borked :-(
>
> FWIW: I think people misunderstood my statement. I specifically did *not*
> propose to *lose* PMI-2 support. I suggested that we change it to
> "on-by-request" instead of the current "on-by-default" so we wouldn't keep
> getting asked about PMI-2 bugs in Slurm. Once the Slurm implementation
> stabilized, then we could reverse that policy.
>
> However, given that both you and Chris appear to prefer to keep it
> "on-by-default", we'll see if we can find a way to detect that PMI-2 is
> broken and then fall back to PMI-1.
>
>
> On May 7, 2014, at 7:39 AM, Joshua Ladd  wrote:
>
> Just saw this thread, and I second Chris' observations: at scale we are
> seeing huge gains in jobstart performance with PMI2 over PMI1. We 
> *CANNOT*loose this functionality. For competitive reasons, I cannot provide 
> exact
> numbers, but let's say the difference is in the ballpark of a full
> order-of-magnitude on 20K ranks versus PMI1. PMI1 is completely
> unacceptable/unusable at scale. Certainly PMI2 still has scaling issues,
> but there is no contest between PMI1 and PMI2.  We (MLNX) are actively
> working to resolve some of the scalability issues in PMI2.
>
> Josh
>
> Joshua S. Ladd
> Staff Engineer, HPC Software
> Mellanox Technologies
>
> Email: josh...@mellanox.com
>
>
> On Wed, May 7, 2014 at 4:00 AM, Ralph Castain  wrote:
>
>> Interesting - how many nodes were involved? As I said, the bad scaling
>> becomes more evident at a fairly high node count.
>>
>> On May 7, 2014, at 12:07 AM, Christopher Samuel 
>> wrote:
>>
>> > -BEGIN PGP SIGNED MESSAGE-
>> > Hash: SHA1
>> >
>> > Hiya Ralph,
>> >
>> > On 07/05/14 14:49, Ralph Castain wrote:
>> >
>> >> I should have looked closer to see the numbers you posted, Chris -
>> >> those include time for MPI wireup. So what you are seeing is that
>> >> mpirun is much more efficient at exchanging the MPI endpoint info
>> >> than PMI. I suspect that PMI2 is not much better as the primary
>> >> reason for the difference is that mpriun sends blobs, while PMI
>> >> requires that everything be encoded into strings and sent in little
>> >> pieces.
>> >>
>> >> Hence, mpirun can exchange the endpoint info (the dreaded "modex"
>> >> operation) much faster, and MPI_Init completes faster. Rest of the
>> >> computation should be the same, so long compute apps will see the
>> >> difference narrow considerably.
>> >
>> > Unfortunately it looks like I had an enthusiastic cleanup at some point
>> > and so I cannot find the out files from those runs at the moment, but
>> > I did find some comparisons from around that time.
>> >
>> > This first pair are comparing running NAMD with OMPI 1.7.3a1r29103
>> > run with mpirun and srun successively from inside the same Slurm job.
>> >
>> > mpirun namd2 macpf.conf
>> > srun --mpi=pmi2 namd2 macpf.conf
>> >
>> > Firstly the mpirun output (grep'ing the interesting bits):
>> >
>> > Charm++> Running on MPI version: 2.1
>> > Info: Benchmark time: 512 CPUs 0.0959179 s/step 0.555081 days/ns
>> 1055.19 MB memory
>> > Info: Benchmark time: 512 CPUs 0.0929002 s/step 0.537617 days/ns
>> 1055.19 MB memory
>> > Info: Benchmark time: 512 CPUs 0.0727373 s/step 0.420933 days/ns
>> 1055.19 MB memory
>> > Info: Benchmark time: 512 CPUs 0.0779532 s/step 0.451118 days/ns
>> 1055.19 MB memory
>> > Info: Benchmark time: 512 CPUs 0.0785246 s/step 0.454425 days/ns
>> 1055.19 MB memory
>> > WallClock: 1403.388550  CPUTime: 1403.388550  Memory: 1119.085938 MB
>> >
>> > Now the srun output:
>> >
>> > Charm++> Running on MPI version: 2.1
>> > Info: Benchmark time: 512 CPUs 0.0906865 s/step 0.524806 days/ns
>> 1036.75 MB memory
>> > Info: Benchmark time: 512 CPUs 0.0874809 s/step 0.506255 days/ns
>> 1036.75 MB memory
>> > Info: Benchmark time: 512 CPUs 0.0746328 s/step 0.431903 days/ns
>> 1036.75 MB memory
>> > Info: Benchmark time: 512 CPUs 0.0726161 s/step 0.420232 days/ns
>> 1036.75 MB memory
>> > Info: Benchmark time: 512 CPUs 0.0710574 s/step 0.411212 days/ns
>> 1036.75 MB memory
>> > WallClock: 1230.784424  CPUTime: 1230.784424  Memory: 1100.648438 MB
>> >
>> >
>> > The next two pairs are first launched using mpirun from 1.6.x and then
>> with srun
>> > from 1.7.3a1r29103.  

Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-07 Thread Ralph Castain
Okay, then we'll just have to develop a workaround for all those Slurm releases 
where PMI-2 is borked :-(

FWIW: I think people misunderstood my statement. I specifically did *not* 
propose to *lose* PMI-2 support. I suggested that we change it to 
"on-by-request" instead of the current "on-by-default" so we wouldn't keep 
getting asked about PMI-2 bugs in Slurm. Once the Slurm implementation 
stabilized, then we could reverse that policy.

However, given that both you and Chris appear to prefer to keep it 
"on-by-default", we'll see if we can find a way to detect that PMI-2 is broken 
and then fall back to PMI-1.


On May 7, 2014, at 7:39 AM, Joshua Ladd  wrote:

> Just saw this thread, and I second Chris' observations: at scale we are 
> seeing huge gains in jobstart performance with PMI2 over PMI1. We CANNOT 
> loose this functionality. For competitive reasons, I cannot provide exact 
> numbers, but let's say the difference is in the ballpark of a full 
> order-of-magnitude on 20K ranks versus PMI1. PMI1 is completely 
> unacceptable/unusable at scale. Certainly PMI2 still has scaling issues, but 
> there is no contest between PMI1 and PMI2.  We (MLNX) are actively working to 
> resolve some of the scalability issues in PMI2. 
> 
> Josh
> 
> Joshua S. Ladd
> Staff Engineer, HPC Software
> Mellanox Technologies
> 
> Email: josh...@mellanox.com
> 
> 
> On Wed, May 7, 2014 at 4:00 AM, Ralph Castain  wrote:
> Interesting - how many nodes were involved? As I said, the bad scaling 
> becomes more evident at a fairly high node count.
> 
> On May 7, 2014, at 12:07 AM, Christopher Samuel  wrote:
> 
> > -BEGIN PGP SIGNED MESSAGE-
> > Hash: SHA1
> >
> > Hiya Ralph,
> >
> > On 07/05/14 14:49, Ralph Castain wrote:
> >
> >> I should have looked closer to see the numbers you posted, Chris -
> >> those include time for MPI wireup. So what you are seeing is that
> >> mpirun is much more efficient at exchanging the MPI endpoint info
> >> than PMI. I suspect that PMI2 is not much better as the primary
> >> reason for the difference is that mpriun sends blobs, while PMI
> >> requires that everything be encoded into strings and sent in little
> >> pieces.
> >>
> >> Hence, mpirun can exchange the endpoint info (the dreaded "modex"
> >> operation) much faster, and MPI_Init completes faster. Rest of the
> >> computation should be the same, so long compute apps will see the
> >> difference narrow considerably.
> >
> > Unfortunately it looks like I had an enthusiastic cleanup at some point
> > and so I cannot find the out files from those runs at the moment, but
> > I did find some comparisons from around that time.
> >
> > This first pair are comparing running NAMD with OMPI 1.7.3a1r29103
> > run with mpirun and srun successively from inside the same Slurm job.
> >
> > mpirun namd2 macpf.conf
> > srun --mpi=pmi2 namd2 macpf.conf
> >
> > Firstly the mpirun output (grep'ing the interesting bits):
> >
> > Charm++> Running on MPI version: 2.1
> > Info: Benchmark time: 512 CPUs 0.0959179 s/step 0.555081 days/ns 1055.19 MB 
> > memory
> > Info: Benchmark time: 512 CPUs 0.0929002 s/step 0.537617 days/ns 1055.19 MB 
> > memory
> > Info: Benchmark time: 512 CPUs 0.0727373 s/step 0.420933 days/ns 1055.19 MB 
> > memory
> > Info: Benchmark time: 512 CPUs 0.0779532 s/step 0.451118 days/ns 1055.19 MB 
> > memory
> > Info: Benchmark time: 512 CPUs 0.0785246 s/step 0.454425 days/ns 1055.19 MB 
> > memory
> > WallClock: 1403.388550  CPUTime: 1403.388550  Memory: 1119.085938 MB
> >
> > Now the srun output:
> >
> > Charm++> Running on MPI version: 2.1
> > Info: Benchmark time: 512 CPUs 0.0906865 s/step 0.524806 days/ns 1036.75 MB 
> > memory
> > Info: Benchmark time: 512 CPUs 0.0874809 s/step 0.506255 days/ns 1036.75 MB 
> > memory
> > Info: Benchmark time: 512 CPUs 0.0746328 s/step 0.431903 days/ns 1036.75 MB 
> > memory
> > Info: Benchmark time: 512 CPUs 0.0726161 s/step 0.420232 days/ns 1036.75 MB 
> > memory
> > Info: Benchmark time: 512 CPUs 0.0710574 s/step 0.411212 days/ns 1036.75 MB 
> > memory
> > WallClock: 1230.784424  CPUTime: 1230.784424  Memory: 1100.648438 MB
> >
> >
> > The next two pairs are first launched using mpirun from 1.6.x and then with 
> > srun
> > from 1.7.3a1r29103.  Again each pair inside the same Slurm job with the 
> > same inputs.
> >
> > First pair mpirun:
> >
> > Charm++> Running on MPI version: 2.1
> > Info: Benchmark time: 64 CPUs 0.410424 s/step 2.37514 days/ns 909.57 MB 
> > memory
> > Info: Benchmark time: 64 CPUs 0.392106 s/step 2.26913 days/ns 909.57 MB 
> > memory
> > Info: Benchmark time: 64 CPUs 0.313136 s/step 1.81213 days/ns 909.57 MB 
> > memory
> > Info: Benchmark time: 64 CPUs 0.316792 s/step 1.83329 days/ns 909.57 MB 
> > memory
> > Info: Benchmark time: 64 CPUs 0.313867 s/step 1.81636 days/ns 909.57 MB 
> > memory
> > WallClock: 8341.524414  CPUTime: 8341.524414  Memory: 975.015625 MB
> >
> > First pair srun:
> >
> > 

Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-07 Thread Joshua Ladd
Just saw this thread, and I second Chris' observations: at scale we are
seeing huge gains in jobstart performance with PMI2 over PMI1. We
*CANNOT*loose this functionality. For competitive reasons, I cannot
provide exact
numbers, but let's say the difference is in the ballpark of a full
order-of-magnitude on 20K ranks versus PMI1. PMI1 is completely
unacceptable/unusable at scale. Certainly PMI2 still has scaling issues,
but there is no contest between PMI1 and PMI2.  We (MLNX) are actively
working to resolve some of the scalability issues in PMI2.

Josh

Joshua S. Ladd
Staff Engineer, HPC Software
Mellanox Technologies

Email: josh...@mellanox.com


On Wed, May 7, 2014 at 4:00 AM, Ralph Castain  wrote:

> Interesting - how many nodes were involved? As I said, the bad scaling
> becomes more evident at a fairly high node count.
>
> On May 7, 2014, at 12:07 AM, Christopher Samuel 
> wrote:
>
> > -BEGIN PGP SIGNED MESSAGE-
> > Hash: SHA1
> >
> > Hiya Ralph,
> >
> > On 07/05/14 14:49, Ralph Castain wrote:
> >
> >> I should have looked closer to see the numbers you posted, Chris -
> >> those include time for MPI wireup. So what you are seeing is that
> >> mpirun is much more efficient at exchanging the MPI endpoint info
> >> than PMI. I suspect that PMI2 is not much better as the primary
> >> reason for the difference is that mpriun sends blobs, while PMI
> >> requires that everything be encoded into strings and sent in little
> >> pieces.
> >>
> >> Hence, mpirun can exchange the endpoint info (the dreaded "modex"
> >> operation) much faster, and MPI_Init completes faster. Rest of the
> >> computation should be the same, so long compute apps will see the
> >> difference narrow considerably.
> >
> > Unfortunately it looks like I had an enthusiastic cleanup at some point
> > and so I cannot find the out files from those runs at the moment, but
> > I did find some comparisons from around that time.
> >
> > This first pair are comparing running NAMD with OMPI 1.7.3a1r29103
> > run with mpirun and srun successively from inside the same Slurm job.
> >
> > mpirun namd2 macpf.conf
> > srun --mpi=pmi2 namd2 macpf.conf
> >
> > Firstly the mpirun output (grep'ing the interesting bits):
> >
> > Charm++> Running on MPI version: 2.1
> > Info: Benchmark time: 512 CPUs 0.0959179 s/step 0.555081 days/ns 1055.19
> MB memory
> > Info: Benchmark time: 512 CPUs 0.0929002 s/step 0.537617 days/ns 1055.19
> MB memory
> > Info: Benchmark time: 512 CPUs 0.0727373 s/step 0.420933 days/ns 1055.19
> MB memory
> > Info: Benchmark time: 512 CPUs 0.0779532 s/step 0.451118 days/ns 1055.19
> MB memory
> > Info: Benchmark time: 512 CPUs 0.0785246 s/step 0.454425 days/ns 1055.19
> MB memory
> > WallClock: 1403.388550  CPUTime: 1403.388550  Memory: 1119.085938 MB
> >
> > Now the srun output:
> >
> > Charm++> Running on MPI version: 2.1
> > Info: Benchmark time: 512 CPUs 0.0906865 s/step 0.524806 days/ns 1036.75
> MB memory
> > Info: Benchmark time: 512 CPUs 0.0874809 s/step 0.506255 days/ns 1036.75
> MB memory
> > Info: Benchmark time: 512 CPUs 0.0746328 s/step 0.431903 days/ns 1036.75
> MB memory
> > Info: Benchmark time: 512 CPUs 0.0726161 s/step 0.420232 days/ns 1036.75
> MB memory
> > Info: Benchmark time: 512 CPUs 0.0710574 s/step 0.411212 days/ns 1036.75
> MB memory
> > WallClock: 1230.784424  CPUTime: 1230.784424  Memory: 1100.648438 MB
> >
> >
> > The next two pairs are first launched using mpirun from 1.6.x and then
> with srun
> > from 1.7.3a1r29103.  Again each pair inside the same Slurm job with the
> same inputs.
> >
> > First pair mpirun:
> >
> > Charm++> Running on MPI version: 2.1
> > Info: Benchmark time: 64 CPUs 0.410424 s/step 2.37514 days/ns 909.57 MB
> memory
> > Info: Benchmark time: 64 CPUs 0.392106 s/step 2.26913 days/ns 909.57 MB
> memory
> > Info: Benchmark time: 64 CPUs 0.313136 s/step 1.81213 days/ns 909.57 MB
> memory
> > Info: Benchmark time: 64 CPUs 0.316792 s/step 1.83329 days/ns 909.57 MB
> memory
> > Info: Benchmark time: 64 CPUs 0.313867 s/step 1.81636 days/ns 909.57 MB
> memory
> > WallClock: 8341.524414  CPUTime: 8341.524414  Memory: 975.015625 MB
> >
> > First pair srun:
> >
> > Charm++> Running on MPI version: 2.1
> > Info: Benchmark time: 64 CPUs 0.341967 s/step 1.97897 days/ns 903.883 MB
> memory
> > Info: Benchmark time: 64 CPUs 0.339644 s/step 1.96553 days/ns 903.883 MB
> memory
> > Info: Benchmark time: 64 CPUs 0.284424 s/step 1.64597 days/ns 903.883 MB
> memory
> > Info: Benchmark time: 64 CPUs 0.28115 s/step 1.62702 days/ns 903.883 MB
> memory
> > Info: Benchmark time: 64 CPUs 0.279536 s/step 1.61769 days/ns 903.883 MB
> memory
> > WallClock: 7476.643555  CPUTime: 7476.643555  Memory: 968.867188 MB
> >
> >
> > Second pair mpirun:
> >
> > Charm++> Running on MPI version: 2.1
> > Info: Benchmark time: 64 CPUs 0.366327 s/step 2.11995 days/ns 939.527 MB
> memory
> > Info: Benchmark time: 64 CPUs 0.359805 s/step 2.0822 days/ns 939.527 MB
> memory
> > Info: 

Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-07 Thread Ralph Castain
Interesting - how many nodes were involved? As I said, the bad scaling becomes 
more evident at a fairly high node count.

On May 7, 2014, at 12:07 AM, Christopher Samuel  wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> Hiya Ralph,
> 
> On 07/05/14 14:49, Ralph Castain wrote:
> 
>> I should have looked closer to see the numbers you posted, Chris -
>> those include time for MPI wireup. So what you are seeing is that
>> mpirun is much more efficient at exchanging the MPI endpoint info
>> than PMI. I suspect that PMI2 is not much better as the primary
>> reason for the difference is that mpriun sends blobs, while PMI
>> requires that everything be encoded into strings and sent in little
>> pieces.
>> 
>> Hence, mpirun can exchange the endpoint info (the dreaded "modex"
>> operation) much faster, and MPI_Init completes faster. Rest of the
>> computation should be the same, so long compute apps will see the
>> difference narrow considerably.
> 
> Unfortunately it looks like I had an enthusiastic cleanup at some point
> and so I cannot find the out files from those runs at the moment, but
> I did find some comparisons from around that time.
> 
> This first pair are comparing running NAMD with OMPI 1.7.3a1r29103
> run with mpirun and srun successively from inside the same Slurm job.
> 
> mpirun namd2 macpf.conf 
> srun --mpi=pmi2 namd2 macpf.conf 
> 
> Firstly the mpirun output (grep'ing the interesting bits):
> 
> Charm++> Running on MPI version: 2.1
> Info: Benchmark time: 512 CPUs 0.0959179 s/step 0.555081 days/ns 1055.19 MB 
> memory
> Info: Benchmark time: 512 CPUs 0.0929002 s/step 0.537617 days/ns 1055.19 MB 
> memory
> Info: Benchmark time: 512 CPUs 0.0727373 s/step 0.420933 days/ns 1055.19 MB 
> memory
> Info: Benchmark time: 512 CPUs 0.0779532 s/step 0.451118 days/ns 1055.19 MB 
> memory
> Info: Benchmark time: 512 CPUs 0.0785246 s/step 0.454425 days/ns 1055.19 MB 
> memory
> WallClock: 1403.388550  CPUTime: 1403.388550  Memory: 1119.085938 MB
> 
> Now the srun output:
> 
> Charm++> Running on MPI version: 2.1
> Info: Benchmark time: 512 CPUs 0.0906865 s/step 0.524806 days/ns 1036.75 MB 
> memory
> Info: Benchmark time: 512 CPUs 0.0874809 s/step 0.506255 days/ns 1036.75 MB 
> memory
> Info: Benchmark time: 512 CPUs 0.0746328 s/step 0.431903 days/ns 1036.75 MB 
> memory
> Info: Benchmark time: 512 CPUs 0.0726161 s/step 0.420232 days/ns 1036.75 MB 
> memory
> Info: Benchmark time: 512 CPUs 0.0710574 s/step 0.411212 days/ns 1036.75 MB 
> memory
> WallClock: 1230.784424  CPUTime: 1230.784424  Memory: 1100.648438 MB
> 
> 
> The next two pairs are first launched using mpirun from 1.6.x and then with 
> srun
> from 1.7.3a1r29103.  Again each pair inside the same Slurm job with the same 
> inputs.
> 
> First pair mpirun:
> 
> Charm++> Running on MPI version: 2.1
> Info: Benchmark time: 64 CPUs 0.410424 s/step 2.37514 days/ns 909.57 MB memory
> Info: Benchmark time: 64 CPUs 0.392106 s/step 2.26913 days/ns 909.57 MB memory
> Info: Benchmark time: 64 CPUs 0.313136 s/step 1.81213 days/ns 909.57 MB memory
> Info: Benchmark time: 64 CPUs 0.316792 s/step 1.83329 days/ns 909.57 MB memory
> Info: Benchmark time: 64 CPUs 0.313867 s/step 1.81636 days/ns 909.57 MB memory
> WallClock: 8341.524414  CPUTime: 8341.524414  Memory: 975.015625 MB
> 
> First pair srun:
> 
> Charm++> Running on MPI version: 2.1
> Info: Benchmark time: 64 CPUs 0.341967 s/step 1.97897 days/ns 903.883 MB 
> memory
> Info: Benchmark time: 64 CPUs 0.339644 s/step 1.96553 days/ns 903.883 MB 
> memory
> Info: Benchmark time: 64 CPUs 0.284424 s/step 1.64597 days/ns 903.883 MB 
> memory
> Info: Benchmark time: 64 CPUs 0.28115 s/step 1.62702 days/ns 903.883 MB memory
> Info: Benchmark time: 64 CPUs 0.279536 s/step 1.61769 days/ns 903.883 MB 
> memory
> WallClock: 7476.643555  CPUTime: 7476.643555  Memory: 968.867188 MB
> 
> 
> Second pair mpirun:
> 
> Charm++> Running on MPI version: 2.1
> Info: Benchmark time: 64 CPUs 0.366327 s/step 2.11995 days/ns 939.527 MB 
> memory
> Info: Benchmark time: 64 CPUs 0.359805 s/step 2.0822 days/ns 939.527 MB memory
> Info: Benchmark time: 64 CPUs 0.292342 s/step 1.69179 days/ns 939.527 MB 
> memory
> Info: Benchmark time: 64 CPUs 0.293499 s/step 1.69849 days/ns 939.527 MB 
> memory
> Info: Benchmark time: 64 CPUs 0.292355 s/step 1.69187 days/ns 939.527 MB 
> memory
> WallClock: 7842.831543  CPUTime: 7842.831543  Memory: 1004.050781 MB
> 
> Second pair srun:
> 
> Charm++> Running on MPI version: 2.1
> Info: Benchmark time: 64 CPUs 0.347864 s/step 2.0131 days/ns 904.91 MB memory
> Info: Benchmark time: 64 CPUs 0.346367 s/step 2.00444 days/ns 904.91 MB memory
> Info: Benchmark time: 64 CPUs 0.29007 s/step 1.67865 days/ns 904.91 MB memory
> Info: Benchmark time: 64 CPUs 0.279447 s/step 1.61717 days/ns 904.91 MB memory
> Info: Benchmark time: 64 CPUs 0.280824 s/step 1.62514 days/ns 904.91 MB memory
> WallClock: 7522.677246  CPUTime: 7522.677246  Memory: 969.433594 MB
> 
> 
> So to me it 

Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-07 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hiya Ralph,

On 07/05/14 14:49, Ralph Castain wrote:

> I should have looked closer to see the numbers you posted, Chris -
> those include time for MPI wireup. So what you are seeing is that
> mpirun is much more efficient at exchanging the MPI endpoint info
> than PMI. I suspect that PMI2 is not much better as the primary
> reason for the difference is that mpriun sends blobs, while PMI
> requires that everything be encoded into strings and sent in little
> pieces.
> 
> Hence, mpirun can exchange the endpoint info (the dreaded "modex"
> operation) much faster, and MPI_Init completes faster. Rest of the
> computation should be the same, so long compute apps will see the
> difference narrow considerably.

Unfortunately it looks like I had an enthusiastic cleanup at some point
and so I cannot find the out files from those runs at the moment, but
I did find some comparisons from around that time.

This first pair are comparing running NAMD with OMPI 1.7.3a1r29103
run with mpirun and srun successively from inside the same Slurm job.

mpirun namd2 macpf.conf 
srun --mpi=pmi2 namd2 macpf.conf 

Firstly the mpirun output (grep'ing the interesting bits):

Charm++> Running on MPI version: 2.1
Info: Benchmark time: 512 CPUs 0.0959179 s/step 0.555081 days/ns 1055.19 MB 
memory
Info: Benchmark time: 512 CPUs 0.0929002 s/step 0.537617 days/ns 1055.19 MB 
memory
Info: Benchmark time: 512 CPUs 0.0727373 s/step 0.420933 days/ns 1055.19 MB 
memory
Info: Benchmark time: 512 CPUs 0.0779532 s/step 0.451118 days/ns 1055.19 MB 
memory
Info: Benchmark time: 512 CPUs 0.0785246 s/step 0.454425 days/ns 1055.19 MB 
memory
WallClock: 1403.388550  CPUTime: 1403.388550  Memory: 1119.085938 MB

Now the srun output:

Charm++> Running on MPI version: 2.1
Info: Benchmark time: 512 CPUs 0.0906865 s/step 0.524806 days/ns 1036.75 MB 
memory
Info: Benchmark time: 512 CPUs 0.0874809 s/step 0.506255 days/ns 1036.75 MB 
memory
Info: Benchmark time: 512 CPUs 0.0746328 s/step 0.431903 days/ns 1036.75 MB 
memory
Info: Benchmark time: 512 CPUs 0.0726161 s/step 0.420232 days/ns 1036.75 MB 
memory
Info: Benchmark time: 512 CPUs 0.0710574 s/step 0.411212 days/ns 1036.75 MB 
memory
WallClock: 1230.784424  CPUTime: 1230.784424  Memory: 1100.648438 MB


The next two pairs are first launched using mpirun from 1.6.x and then with srun
from 1.7.3a1r29103.  Again each pair inside the same Slurm job with the same 
inputs.

First pair mpirun:

Charm++> Running on MPI version: 2.1
Info: Benchmark time: 64 CPUs 0.410424 s/step 2.37514 days/ns 909.57 MB memory
Info: Benchmark time: 64 CPUs 0.392106 s/step 2.26913 days/ns 909.57 MB memory
Info: Benchmark time: 64 CPUs 0.313136 s/step 1.81213 days/ns 909.57 MB memory
Info: Benchmark time: 64 CPUs 0.316792 s/step 1.83329 days/ns 909.57 MB memory
Info: Benchmark time: 64 CPUs 0.313867 s/step 1.81636 days/ns 909.57 MB memory
WallClock: 8341.524414  CPUTime: 8341.524414  Memory: 975.015625 MB

First pair srun:

Charm++> Running on MPI version: 2.1
Info: Benchmark time: 64 CPUs 0.341967 s/step 1.97897 days/ns 903.883 MB memory
Info: Benchmark time: 64 CPUs 0.339644 s/step 1.96553 days/ns 903.883 MB memory
Info: Benchmark time: 64 CPUs 0.284424 s/step 1.64597 days/ns 903.883 MB memory
Info: Benchmark time: 64 CPUs 0.28115 s/step 1.62702 days/ns 903.883 MB memory
Info: Benchmark time: 64 CPUs 0.279536 s/step 1.61769 days/ns 903.883 MB memory
WallClock: 7476.643555  CPUTime: 7476.643555  Memory: 968.867188 MB


Second pair mpirun:

Charm++> Running on MPI version: 2.1
Info: Benchmark time: 64 CPUs 0.366327 s/step 2.11995 days/ns 939.527 MB memory
Info: Benchmark time: 64 CPUs 0.359805 s/step 2.0822 days/ns 939.527 MB memory
Info: Benchmark time: 64 CPUs 0.292342 s/step 1.69179 days/ns 939.527 MB memory
Info: Benchmark time: 64 CPUs 0.293499 s/step 1.69849 days/ns 939.527 MB memory
Info: Benchmark time: 64 CPUs 0.292355 s/step 1.69187 days/ns 939.527 MB memory
WallClock: 7842.831543  CPUTime: 7842.831543  Memory: 1004.050781 MB

Second pair srun:

Charm++> Running on MPI version: 2.1
Info: Benchmark time: 64 CPUs 0.347864 s/step 2.0131 days/ns 904.91 MB memory
Info: Benchmark time: 64 CPUs 0.346367 s/step 2.00444 days/ns 904.91 MB memory
Info: Benchmark time: 64 CPUs 0.29007 s/step 1.67865 days/ns 904.91 MB memory
Info: Benchmark time: 64 CPUs 0.279447 s/step 1.61717 days/ns 904.91 MB memory
Info: Benchmark time: 64 CPUs 0.280824 s/step 1.62514 days/ns 904.91 MB memory
WallClock: 7522.677246  CPUTime: 7522.677246  Memory: 969.433594 MB


So to me it looks like (for NAMD on our system at least) that
PMI2 does seem to give better scalability.

All the best!
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with 

Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-07 Thread Ralph Castain
I should have looked closer to see the numbers you posted, Chris - those 
include time for MPI wireup. So what you are seeing is that mpirun is much more 
efficient at exchanging the MPI endpoint info than PMI. I suspect that PMI2 is 
not much better as the primary reason for the difference is that mpriun sends 
blobs, while PMI requires that everything be encoded into strings and sent in 
little pieces.

Hence, mpirun can exchange the endpoint info (the dreaded "modex" operation) 
much faster, and MPI_Init completes faster. Rest of the computation should be 
the same, so long compute apps will see the difference narrow considerably.

HTH
Ralph

On May 6, 2014, at 9:45 PM, Ralph Castain  wrote:

> Ah, interesting - my comments were in respect to startup time (specifically, 
> MPI wireup)
> 
> On May 6, 2014, at 8:49 PM, Christopher Samuel  wrote:
> 
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA1
>> 
>> On 07/05/14 13:37, Moody, Adam T. wrote:
>> 
>>> Hi Chris,
>> 
>> Hi Adam,
>> 
>>> I'm interested in SLURM / OpenMPI startup numbers, but I haven't
>>> done this testing myself.  We're stuck with an older version of
>>> SLURM for various internal reasons, and I'm wondering whether it's
>>> worth the effort to back port the PMI2 support.  Can you share some
>>> of the differences in times at different scales?
>> 
>> We've not looked at startup times I'm afraid, this was time to
>> solution. We noticed it with Slurm when we first started using on
>> x86-64 for our NAMD tests (this from a posting to the list last year
>> when I raised the issue and were told PMI2 would be the solution):
>> 
>>> Slurm 2.6.0, RHEL 6.4 (latest kernel), FDR IB.
>>> 
>>> Here are some timings as reported as the WallClock time by NAMD 
>>> itself (so not including startup/tear down overhead from Slurm).
>>> 
>>> srun:
>>> 
>>> run1/slurm-93744.out:WallClock: 695.079773  CPUTime: 695.079773 
>>> run4/slurm-94011.out:WallClock: 723.907959  CPUTime: 723.907959 
>>> run5/slurm-94013.out:WallClock: 726.156799  CPUTime: 726.156799 
>>> run6/slurm-94017.out:WallClock: 724.828918  CPUTime: 724.828918
>>> 
>>> Average of 692 seconds
>>> 
>>> mpirun:
>>> 
>>> run2/slurm-93746.out:WallClock: 559.311035  CPUTime: 559.311035 
>>> run3/slurm-93910.out:WallClock: 544.116333  CPUTime: 544.116333 
>>> run7/slurm-94019.out:WallClock: 586.072693  CPUTime: 586.072693
>>> 
>>> Average of 563 seconds.
>>> 
>>> So that's about 23% slower.
>>> 
>>> Everything is identical (they're all symlinks to the same golden 
>>> master) *except* for the srun / mpirun which is modified by
>>> copying the batch script and substituting mpirun for srun.
>> 
>> 
>> 
>> - -- 
>> Christopher SamuelSenior Systems Administrator
>> VLSCI - Victorian Life Sciences Computation Initiative
>> Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
>> http://www.vlsci.org.au/  http://twitter.com/vlsci
>> 
>> -BEGIN PGP SIGNATURE-
>> Version: GnuPG v1.4.14 (GNU/Linux)
>> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>> 
>> iEYEARECAAYFAlNprUUACgkQO2KABBYQAh9rLACfcZc4HR/u6G0bJejM3C/my7Nw
>> 8b4AnRasOMvKZjpjpyKkbplc6/Iq9qBK
>> =pqH9
>> -END PGP SIGNATURE-
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/devel/2014/05/14694.php
> 



Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-07 Thread Ralph Castain
Ah, interesting - my comments were in respect to startup time (specifically, 
MPI wireup)

On May 6, 2014, at 8:49 PM, Christopher Samuel  wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> On 07/05/14 13:37, Moody, Adam T. wrote:
> 
>> Hi Chris,
> 
> Hi Adam,
> 
>> I'm interested in SLURM / OpenMPI startup numbers, but I haven't
>> done this testing myself.  We're stuck with an older version of
>> SLURM for various internal reasons, and I'm wondering whether it's
>> worth the effort to back port the PMI2 support.  Can you share some
>> of the differences in times at different scales?
> 
> We've not looked at startup times I'm afraid, this was time to
> solution. We noticed it with Slurm when we first started using on
> x86-64 for our NAMD tests (this from a posting to the list last year
> when I raised the issue and were told PMI2 would be the solution):
> 
>> Slurm 2.6.0, RHEL 6.4 (latest kernel), FDR IB.
>> 
>> Here are some timings as reported as the WallClock time by NAMD 
>> itself (so not including startup/tear down overhead from Slurm).
>> 
>> srun:
>> 
>> run1/slurm-93744.out:WallClock: 695.079773  CPUTime: 695.079773 
>> run4/slurm-94011.out:WallClock: 723.907959  CPUTime: 723.907959 
>> run5/slurm-94013.out:WallClock: 726.156799  CPUTime: 726.156799 
>> run6/slurm-94017.out:WallClock: 724.828918  CPUTime: 724.828918
>> 
>> Average of 692 seconds
>> 
>> mpirun:
>> 
>> run2/slurm-93746.out:WallClock: 559.311035  CPUTime: 559.311035 
>> run3/slurm-93910.out:WallClock: 544.116333  CPUTime: 544.116333 
>> run7/slurm-94019.out:WallClock: 586.072693  CPUTime: 586.072693
>> 
>> Average of 563 seconds.
>> 
>> So that's about 23% slower.
>> 
>> Everything is identical (they're all symlinks to the same golden 
>> master) *except* for the srun / mpirun which is modified by
>> copying the batch script and substituting mpirun for srun.
> 
> 
> 
> - -- 
> Christopher SamuelSenior Systems Administrator
> VLSCI - Victorian Life Sciences Computation Initiative
> Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
> http://www.vlsci.org.au/  http://twitter.com/vlsci
> 
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.14 (GNU/Linux)
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
> 
> iEYEARECAAYFAlNprUUACgkQO2KABBYQAh9rLACfcZc4HR/u6G0bJejM3C/my7Nw
> 8b4AnRasOMvKZjpjpyKkbplc6/Iq9qBK
> =pqH9
> -END PGP SIGNATURE-
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14694.php



Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-07 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 07/05/14 13:37, Moody, Adam T. wrote:

> Hi Chris,

Hi Adam,

> I'm interested in SLURM / OpenMPI startup numbers, but I haven't
> done this testing myself.  We're stuck with an older version of
> SLURM for various internal reasons, and I'm wondering whether it's
> worth the effort to back port the PMI2 support.  Can you share some
> of the differences in times at different scales?

We've not looked at startup times I'm afraid, this was time to
solution. We noticed it with Slurm when we first started using on
x86-64 for our NAMD tests (this from a posting to the list last year
when I raised the issue and were told PMI2 would be the solution):

> Slurm 2.6.0, RHEL 6.4 (latest kernel), FDR IB.
> 
> Here are some timings as reported as the WallClock time by NAMD 
> itself (so not including startup/tear down overhead from Slurm).
> 
> srun:
> 
> run1/slurm-93744.out:WallClock: 695.079773  CPUTime: 695.079773 
> run4/slurm-94011.out:WallClock: 723.907959  CPUTime: 723.907959 
> run5/slurm-94013.out:WallClock: 726.156799  CPUTime: 726.156799 
> run6/slurm-94017.out:WallClock: 724.828918  CPUTime: 724.828918
> 
> Average of 692 seconds
> 
> mpirun:
> 
> run2/slurm-93746.out:WallClock: 559.311035  CPUTime: 559.311035 
> run3/slurm-93910.out:WallClock: 544.116333  CPUTime: 544.116333 
> run7/slurm-94019.out:WallClock: 586.072693  CPUTime: 586.072693
> 
> Average of 563 seconds.
> 
> So that's about 23% slower.
> 
> Everything is identical (they're all symlinks to the same golden 
> master) *except* for the srun / mpirun which is modified by
> copying the batch script and substituting mpirun for srun.



- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.14 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlNprUUACgkQO2KABBYQAh9rLACfcZc4HR/u6G0bJejM3C/my7Nw
8b4AnRasOMvKZjpjpyKkbplc6/Iq9qBK
=pqH9
-END PGP SIGNATURE-


Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is specifically requested

2014-05-07 Thread Ralph Castain
FWIW: we see varying reports about the scalability of Slurm, especially at 
large cluster sizes. Last I saw/tested, there is a quadratic term that begins 
to dominate above 2k nodes. Others swear it is better . Guess I'd be 
cautious and definitely test things before investing in a move - I'm not 
convinced.


On May 6, 2014, at 8:37 PM, Moody, Adam T. <mood...@llnl.gov> wrote:

> Hi Chris,
> I'm interested in SLURM / OpenMPI startup numbers, but I haven't done this 
> testing myself.  We're stuck with an older version of SLURM for various 
> internal reasons, and I'm wondering whether it's worth the effort to back 
> port the PMI2 support.  Can you share some of the differences in times at 
> different scales?
> Thanks,
> -Adam
> 
> From: devel [devel-boun...@open-mpi.org] on behalf of Christopher Samuel 
> [sam...@unimelb.edu.au]
> Sent: Tuesday, May 06, 2014 8:32 PM
> To: de...@open-mpi.org
> Subject: Re: [OMPI devel] RFC: Force Slurm to use PMI-1 unless PMI-2 is 
> specifically requested
> 
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> On 07/05/14 12:53, Ralph Castain wrote:
> 
>> We have been seeing a lot of problems with the Slurm PMI-2 support
>> (not in OMPI - it's the code in Slurm that is having problems). At
>> this time, I'm unaware of any advantage in using PMI-2 over PMI-1
>> in Slurm - the scaling is equally poor, and PMI-2 does not supports
>> any additional functionality.
>> 
>> I know that Cray PMI-2 has a definite advantage, so I'm proposing
>> that we turn PMI-2 "off" when under Slurm unless the user
>> specifically requests we use it.
> 
> Our local testing has shown that PMI-2 in 1.7.x gives a massive
> improvement in scaling when starting jobs with srun over using srun
> with OMPI 1.6.x and now that OMPI 1.8.x is out we're planning on
> moving to using PMI2 with OMPI and srun.
> 
> Using mpirun gives good performance with OMPI 1.6.x but Slurm then
> gets all its memory stats wrong and if you run with CR_Core_Memory in
> Slurm you have a very high risk your job will get killed incorrectly.
> 
> All the best,
> Chris
> - --
> Christopher SamuelSenior Systems Administrator
> VLSCI - Victorian Life Sciences Computation Initiative
> Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
> http://www.vlsci.org.au/  http://twitter.com/vlsci
> 
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.14 (GNU/Linux)
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
> 
> iEYEARECAAYFAlNpqUwACgkQO2KABBYQAh/igwCfQSB/v3tI37Rq4z5z/0xT/BYU
> 6ToAn3Qt6tOt46LQD25eHhlx+3z/sjnQ
> =LEHf
> -END PGP SIGNATURE-
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14691.php
> ___
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2014/05/14692.php