Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun

2013-09-06 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 06/09/13 14:14, Christopher Samuel wrote:

> However, modifying the test program confirms that variable is getting
> propagated as expected with both mpirun and srun for 1.6.5 and the 1.7
> snapshot. :-(

Investigating further by setting:

export OMPI_MCA_orte_report_bindings=1
export SLURM_CPU_BIND=core
export SLURM_CPU_BIND_VERBOSE=verbose

reveals that only OMPI 1.6.5 with mpirun reports bindings being set
(see below).   We cannot understand why Slurm doesn't *appear* to be
setting bindings as we have the correct settings according to the
documentation.

Whilst it may explain the difference between 1.6.5 mpirun and srun
it doesn't to explain why the 1.7 snapshot is so much better as you'd
expect them to be hurt in the same way.


==OPENMPI 1.6.5==
==mpirun==
[barcoo003:03633] System has detected external process binding to cores 0001
[barcoo003:03633] MCW rank 0 bound to socket 0[core 0]: [B]
[barcoo004:04504] MCW rank 1 bound to socket 0[core 0]: [B]
Hello, World, I am 0 of 2 on host barcoo003 from app number 0 universe size 2 
universe envar 2
Hello, World, I am 1 of 2 on host barcoo004 from app number 0 universe size 2 
universe envar 2
==srun==
Hello, World, I am 0 of 2 on host barcoo003 from app number 1 universe size 2 
universe envar NULL
Hello, World, I am 1 of 2 on host barcoo004 from app number 1 universe size 2 
universe envar NULL
=
==OPENMPI 1.7.3==
DANGER: YOU ARE LOADING A TEST VERSION OF OPENMPI. THIS MAY BE BAD.
==mpirun==
Hello, World, I am 0 of 2 on host barcoo003 from app number 0 universe size 2 
universe envar 2
Hello, World, I am 1 of 2 on host barcoo004 from app number 0 universe size 2 
universe envar 2
==srun==
Hello, World, I am 0 of 2 on host barcoo003 from app number 0 universe size 2 
universe envar NULL
Hello, World, I am 1 of 2 on host barcoo004 from app number 0 universe size 2 
universe envar NULL
=



- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlIpcxcACgkQO2KABBYQAh/wdQCfR4q7DfGqJVSU0O3BmgXqAn8w
HsEAn3po0xaxB0+ywejWgSjQ385da7Pa
=T3w4
-END PGP SIGNATURE-


Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun

2013-09-06 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 06/09/13 00:23, Hjelm, Nathan T wrote:

> I assume that process binding is enabled for both mpirun and srun?
> If not that could account for a difference between the runtimes.

You raise an interesting point, we have been doing that with:

[samuel@barcoo ~]$ module show openmpi 2>&1 | grep binding
setenv   OMPI_MCA_orte_process_binding core

However, modifying the test program confirms that variable is getting
propagated as expected with both mpirun and srun for 1.6.5 and the 1.7
snapshot. :-(

cheers,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlIpVp4ACgkQO2KABBYQAh88rQCggOZkAjPV+/1PX2R9auuij+1M
jdsAn17nDCoubkdvCsLRKozqGEYWjOY1
=RaoK
-END PGP SIGNATURE-


Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun

2013-09-05 Thread Hjelm, Nathan T
I assume that process binding is enabled for both mpirun and srun? If not that 
could account for a difference between the runtimes.

-Nathan

From: devel [devel-boun...@open-mpi.org] on behalf of Ralph Castain 
[r...@open-mpi.org]
Sent: Thursday, September 05, 2013 8:19 AM
To: Open MPI Developers
Subject: Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20%
slowed than with mpirun

No, nothing significant there. Afraid I've exhausted my thoughts on why the 
difference might exist.

Anyone else care to chime in?

On Sep 4, 2013, at 9:34 PM, Christopher Samuel <sam...@unimelb.edu.au> wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> Hi Ralph,
>
> On 05/09/13 12:50, Ralph Castain wrote:
>
>> Jeff and I were looking at a similar issue today and suddenly
>> realized that the mappings were different - i.e., what ranks are
>> on what nodes differs depending on how you launch. You might want
>> to check if that's the issue here as well. Just launch the
>> attached program using mpirun vs srun and check to see if the maps
>> are the same or not.
>
> Very interesting, the ranks to node mappings are identical in all
> cases (mpirun and srun for 1.6.5 and my test 1.7.3 snapshot) but what
> is different is as follows.
>
>
> For the 1.6.5 build I see mpirun report:
>
> number 0 universe size 64 universe envar 64
>
> whereas srun report:
>
> number 1 universe size 64 universe envar NULL
>
>
>
> For the 1.7.3 snapshot both report "number 0" so the only difference
> there is that mpirun has:
>
> envar 64
>
> whereas srun has:
>
> envar NULL
>
>
> Are these differences significant?
>
> I'm intrigued that the problem child (srun 1.6.5) is the only one
> where number is 1.
>
> All the best,
> Chris
> - --
> Christopher SamuelSenior Systems Administrator
> VLSCI - Victorian Life Sciences Computation Initiative
> Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
> http://www.vlsci.org.au/  http://twitter.com/vlsci
>
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.11 (GNU/Linux)
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>
> iEYEARECAAYFAlIoCesACgkQO2KABBYQAh+0NACeK9uyDk3UZerufAopuQRxhR/T
> 4skAmwS/X+8jNOPlGt4H/t5yRK8vmMer
> =8TGu
> -END PGP SIGNATURE-
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun

2013-09-05 Thread Ralph Castain
No, nothing significant there. Afraid I've exhausted my thoughts on why the 
difference might exist.

Anyone else care to chime in?

On Sep 4, 2013, at 9:34 PM, Christopher Samuel  wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> Hi Ralph,
> 
> On 05/09/13 12:50, Ralph Castain wrote:
> 
>> Jeff and I were looking at a similar issue today and suddenly 
>> realized that the mappings were different - i.e., what ranks are
>> on what nodes differs depending on how you launch. You might want
>> to check if that's the issue here as well. Just launch the
>> attached program using mpirun vs srun and check to see if the maps
>> are the same or not.
> 
> Very interesting, the ranks to node mappings are identical in all
> cases (mpirun and srun for 1.6.5 and my test 1.7.3 snapshot) but what
> is different is as follows.
> 
> 
> For the 1.6.5 build I see mpirun report:
> 
> number 0 universe size 64 universe envar 64
> 
> whereas srun report:
> 
> number 1 universe size 64 universe envar NULL
> 
> 
> 
> For the 1.7.3 snapshot both report "number 0" so the only difference
> there is that mpirun has:
> 
> envar 64
> 
> whereas srun has:
> 
> envar NULL
> 
> 
> Are these differences significant?
> 
> I'm intrigued that the problem child (srun 1.6.5) is the only one
> where number is 1.
> 
> All the best,
> Chris
> - -- 
> Christopher SamuelSenior Systems Administrator
> VLSCI - Victorian Life Sciences Computation Initiative
> Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
> http://www.vlsci.org.au/  http://twitter.com/vlsci
> 
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.11 (GNU/Linux)
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
> 
> iEYEARECAAYFAlIoCesACgkQO2KABBYQAh+0NACeK9uyDk3UZerufAopuQRxhR/T
> 4skAmwS/X+8jNOPlGt4H/t5yRK8vmMer
> =8TGu
> -END PGP SIGNATURE-
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun

2013-09-05 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi Ralph,

On 05/09/13 12:50, Ralph Castain wrote:

> Jeff and I were looking at a similar issue today and suddenly 
> realized that the mappings were different - i.e., what ranks are
> on what nodes differs depending on how you launch. You might want
> to check if that's the issue here as well. Just launch the
> attached program using mpirun vs srun and check to see if the maps
> are the same or not.

Very interesting, the ranks to node mappings are identical in all
cases (mpirun and srun for 1.6.5 and my test 1.7.3 snapshot) but what
is different is as follows.


For the 1.6.5 build I see mpirun report:

number 0 universe size 64 universe envar 64

whereas srun report:

number 1 universe size 64 universe envar NULL



For the 1.7.3 snapshot both report "number 0" so the only difference
there is that mpirun has:

envar 64

whereas srun has:

envar NULL


Are these differences significant?

I'm intrigued that the problem child (srun 1.6.5) is the only one
where number is 1.

All the best,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlIoCesACgkQO2KABBYQAh+0NACeK9uyDk3UZerufAopuQRxhR/T
4skAmwS/X+8jNOPlGt4H/t5yRK8vmMer
=8TGu
-END PGP SIGNATURE-


Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun

2013-09-04 Thread Ralph Castain
Jeff and I were looking at a similar issue today and suddenly realized that the 
mappings were different - i.e., what ranks are on what nodes differs depending 
on how you launch. You might want to check if that's the issue here as well. 
Just launch the attached program using mpirun vs srun and check to see if the 
maps are the same or not.

Ralph


hello_nodename.c
Description: Binary data


On Sep 4, 2013, at 7:15 PM, Christopher Samuel  wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> On 04/09/13 18:33, George Bosilca wrote:
> 
>> You can confirm that the slowdown happen during the MPI
>> initialization stages by profiling the application (especially the
>> MPI_Init call).
> 
> NAMD helpfully prints benchmark and timing numbers during the initial
> part of the simulation, so here's what they say.  For both seconds
> per step and days per nanosecond of simulation less is better.
> 
> I've included the benchmark numbers (every 100 steps or so from the
> start) and the final timing number after 25000 steps.  It looks like
> to me (as a sysadmin and not an MD person) that the final timing
> number includes CPU time in seconds per step and wallclock time in
> seconds per step.
> 
> 64 cores over 10 nodes:
> 
> OMPI 1.7.3a1r29103 mpirun
> 
> Info: Benchmark time: 64 CPUs 0.410424 s/step 2.37514 days/ns 909.57 MB memory
> Info: Benchmark time: 64 CPUs 0.392106 s/step 2.26913 days/ns 909.57 MB memory
> Info: Benchmark time: 64 CPUs 0.313136 s/step 1.81213 days/ns 909.57 MB memory
> Info: Benchmark time: 64 CPUs 0.316792 s/step 1.83329 days/ns 909.57 MB memory
> Info: Benchmark time: 64 CPUs 0.313867 s/step 1.81636 days/ns 909.57 MB memory
> 
> TIMING: 25000  CPU: 8247.2, 0.330157/step  Wall: 8247.2, 0.330157/step, 
> 0.0229276 hours remaining, 921.894531 MB of memory in use.
> 
> OMPI 1.7.3a1r29103 srun
> 
> Info: Benchmark time: 64 CPUs 0.341967 s/step 1.97897 days/ns 903.883 MB 
> memory
> Info: Benchmark time: 64 CPUs 0.339644 s/step 1.96553 days/ns 903.883 MB 
> memory
> Info: Benchmark time: 64 CPUs 0.284424 s/step 1.64597 days/ns 903.883 MB 
> memory
> Info: Benchmark time: 64 CPUs 0.28115 s/step 1.62702 days/ns 903.883 MB memory
> Info: Benchmark time: 64 CPUs 0.279536 s/step 1.61769 days/ns 903.883 MB 
> memory
> 
> TIMING: 25000  CPU: 7390.15, 0.296/step  Wall: 7390.15, 0.296/step, 0.020 
> hours remaining, 915.746094 MB of memory in use.
> 
> 
> 64 cores over 18 nodes:
> 
> OMPI 1.6.5 mpirun
> 
> Info: Benchmark time: 64 CPUs 0.366327 s/step 2.11995 days/ns 939.527 MB 
> memory
> Info: Benchmark time: 64 CPUs 0.359805 s/step 2.0822 days/ns 939.527 MB memory
> Info: Benchmark time: 64 CPUs 0.292342 s/step 1.69179 days/ns 939.527 MB 
> memory
> Info: Benchmark time: 64 CPUs 0.293499 s/step 1.69849 days/ns 939.527 MB 
> memory
> Info: Benchmark time: 64 CPUs 0.292355 s/step 1.69187 days/ns 939.527 MB 
> memory
> 
> TIMING: 25000  CPU: 7754.17, 0.312071/step  Wall: 7754.17, 0.312071/step, 
> 0.0216716 hours remaining, 950.929688 MB of memory in use.
> 
> OMPI 1.7.3a1r29103 srun
> 
> Info: Benchmark time: 64 CPUs 0.347864 s/step 2.0131 days/ns 904.91 MB memory
> Info: Benchmark time: 64 CPUs 0.346367 s/step 2.00444 days/ns 904.91 MB memory
> Info: Benchmark time: 64 CPUs 0.29007 s/step 1.67865 days/ns 904.91 MB memory
> Info: Benchmark time: 64 CPUs 0.279447 s/step 1.61717 days/ns 904.91 MB memory
> Info: Benchmark time: 64 CPUs 0.280824 s/step 1.62514 days/ns 904.91 MB memory
> 
> TIMING: 25000  CPU: 7420.91, 0.296029/step  Wall: 7420.91, 0.296029/step, 
> 0.0205575 hours remaining, 916.312500 MB of memory in use.
> 
> 
> Hope this is useful!
> 
> All the best,
> Chris
> - -- 
> Christopher SamuelSenior Systems Administrator
> VLSCI - Victorian Life Sciences Computation Initiative
> Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
> http://www.vlsci.org.au/  http://twitter.com/vlsci
> 
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.11 (GNU/Linux)
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
> 
> iEYEARECAAYFAlIn6UoACgkQO2KABBYQAh9GWgCghcYKSj1i9rDDQospURAeusD5
> E+EAn2beqUlYZWHxi1Dgj8ZEpiai4zH1
> =k5Uz
> -END PGP SIGNATURE-
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun

2013-09-04 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 04/09/13 18:33, George Bosilca wrote:

> You can confirm that the slowdown happen during the MPI
> initialization stages by profiling the application (especially the
> MPI_Init call).

NAMD helpfully prints benchmark and timing numbers during the initial
part of the simulation, so here's what they say.  For both seconds
per step and days per nanosecond of simulation less is better.

I've included the benchmark numbers (every 100 steps or so from the
start) and the final timing number after 25000 steps.  It looks like
to me (as a sysadmin and not an MD person) that the final timing
number includes CPU time in seconds per step and wallclock time in
seconds per step.

64 cores over 10 nodes:

OMPI 1.7.3a1r29103 mpirun

Info: Benchmark time: 64 CPUs 0.410424 s/step 2.37514 days/ns 909.57 MB memory
Info: Benchmark time: 64 CPUs 0.392106 s/step 2.26913 days/ns 909.57 MB memory
Info: Benchmark time: 64 CPUs 0.313136 s/step 1.81213 days/ns 909.57 MB memory
Info: Benchmark time: 64 CPUs 0.316792 s/step 1.83329 days/ns 909.57 MB memory
Info: Benchmark time: 64 CPUs 0.313867 s/step 1.81636 days/ns 909.57 MB memory

TIMING: 25000  CPU: 8247.2, 0.330157/step  Wall: 8247.2, 0.330157/step, 
0.0229276 hours remaining, 921.894531 MB of memory in use.

OMPI 1.7.3a1r29103 srun

Info: Benchmark time: 64 CPUs 0.341967 s/step 1.97897 days/ns 903.883 MB memory
Info: Benchmark time: 64 CPUs 0.339644 s/step 1.96553 days/ns 903.883 MB memory
Info: Benchmark time: 64 CPUs 0.284424 s/step 1.64597 days/ns 903.883 MB memory
Info: Benchmark time: 64 CPUs 0.28115 s/step 1.62702 days/ns 903.883 MB memory
Info: Benchmark time: 64 CPUs 0.279536 s/step 1.61769 days/ns 903.883 MB memory

TIMING: 25000  CPU: 7390.15, 0.296/step  Wall: 7390.15, 0.296/step, 0.020 
hours remaining, 915.746094 MB of memory in use.


64 cores over 18 nodes:

OMPI 1.6.5 mpirun

Info: Benchmark time: 64 CPUs 0.366327 s/step 2.11995 days/ns 939.527 MB memory
Info: Benchmark time: 64 CPUs 0.359805 s/step 2.0822 days/ns 939.527 MB memory
Info: Benchmark time: 64 CPUs 0.292342 s/step 1.69179 days/ns 939.527 MB memory
Info: Benchmark time: 64 CPUs 0.293499 s/step 1.69849 days/ns 939.527 MB memory
Info: Benchmark time: 64 CPUs 0.292355 s/step 1.69187 days/ns 939.527 MB memory

TIMING: 25000  CPU: 7754.17, 0.312071/step  Wall: 7754.17, 0.312071/step, 
0.0216716 hours remaining, 950.929688 MB of memory in use.

OMPI 1.7.3a1r29103 srun

Info: Benchmark time: 64 CPUs 0.347864 s/step 2.0131 days/ns 904.91 MB memory
Info: Benchmark time: 64 CPUs 0.346367 s/step 2.00444 days/ns 904.91 MB memory
Info: Benchmark time: 64 CPUs 0.29007 s/step 1.67865 days/ns 904.91 MB memory
Info: Benchmark time: 64 CPUs 0.279447 s/step 1.61717 days/ns 904.91 MB memory
Info: Benchmark time: 64 CPUs 0.280824 s/step 1.62514 days/ns 904.91 MB memory

TIMING: 25000  CPU: 7420.91, 0.296029/step  Wall: 7420.91, 0.296029/step, 
0.0205575 hours remaining, 916.312500 MB of memory in use.


Hope this is useful!

All the best,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlIn6UoACgkQO2KABBYQAh9GWgCghcYKSj1i9rDDQospURAeusD5
E+EAn2beqUlYZWHxi1Dgj8ZEpiai4zH1
=k5Uz
-END PGP SIGNATURE-


Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun

2013-09-04 Thread Jeff Squyres (jsquyres)
On Sep 4, 2013, at 4:33 AM, George Bosilca  wrote:

> You can confirm that the slowdown happen during the MPI initialization stages 
> by profiling the application (especially the MPI_Init call).

You can also try just launching "MPI hello world" (i.e., examples/hello_c.c).  
It just calls MPI_INIT / MPI_FINALIZE.

Additionally, you might want to try launching the ring program, too 
(examples/ring_c.c).  That program sends a small message around in a ring, 
which forces some MPI communication to occur, and therefore does at least some 
level of setup in the BTLs, etc. (remember: most BTLs are lazy-connect, so they 
don't actually do anything until the first send.  So a simple "ring" program 
sets up *some* BTL connections, but not nearly all of them).

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/



Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun

2013-09-04 Thread Ralph Castain
This is 1.7.3 - there is no comm thread in ORTE in that version.

On Sep 4, 2013, at 1:33 AM, George Bosilca  wrote:

> You can confirm that the slowdown happen during the MPI initialization stages 
> by profiling the application (especially the MPI_Init call).
> 
> Another possible cause of slowdown might be the communication thread in the 
> ORTE. If it remains active outside the initialization it will definitively 
> disturb the application, by taking away critical resources.
> 
>  George.
> 
> On Sep 4, 2013, at 05:59 , Christopher Samuel  wrote:
> 
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA1
>> 
>> On 04/09/13 11:29, Ralph Castain wrote:
>> 
>>> Your code is obviously doing something much more than just
>>> launching and wiring up, so it is difficult to assess the
>>> difference in speed between 1.6.5 and 1.7.3 - my guess is that it
>>> has to do with changes in the MPI transport layer and nothing to do
>>> with PMI or not.
>> 
>> I'm testing with what would be our most used application in aggregate
>> across our systems, the NAMD molecular dynamics code from here:
>> 
>> http://www.ks.uiuc.edu/Research/namd/
>> 
>> so yes,  you're quite right, it's doing a lot more than that and has a
>> reputation for being a *very* chatty MPI code.
>> 
>> For comparison whilst users see GROMACS also suffer with srun under
>> 1.6.5 they don't see anything like the slow down that NAMD gets.
>> 
>> All the best,
>> Chris
>> - -- 
>> Christopher SamuelSenior Systems Administrator
>> VLSCI - Victorian Life Sciences Computation Initiative
>> Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
>> http://www.vlsci.org.au/  http://twitter.com/vlsci
>> 
>> -BEGIN PGP SIGNATURE-
>> Version: GnuPG v1.4.11 (GNU/Linux)
>> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
>> 
>> iEYEARECAAYFAlImsCwACgkQO2KABBYQAh8c4wCfQlOd6ZL68tncAd1h3Fyb1hAr
>> DicAn06seL8GzYPGtGImnYkb7sYd5op9
>> =pkwZ
>> -END PGP SIGNATURE-
>> ___
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun

2013-09-04 Thread George Bosilca
You can confirm that the slowdown happen during the MPI initialization stages 
by profiling the application (especially the MPI_Init call).

Another possible cause of slowdown might be the communication thread in the 
ORTE. If it remains active outside the initialization it will definitively 
disturb the application, by taking away critical resources.

  George.

On Sep 4, 2013, at 05:59 , Christopher Samuel  wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> On 04/09/13 11:29, Ralph Castain wrote:
> 
>> Your code is obviously doing something much more than just
>> launching and wiring up, so it is difficult to assess the
>> difference in speed between 1.6.5 and 1.7.3 - my guess is that it
>> has to do with changes in the MPI transport layer and nothing to do
>> with PMI or not.
> 
> I'm testing with what would be our most used application in aggregate
> across our systems, the NAMD molecular dynamics code from here:
> 
> http://www.ks.uiuc.edu/Research/namd/
> 
> so yes,  you're quite right, it's doing a lot more than that and has a
> reputation for being a *very* chatty MPI code.
> 
> For comparison whilst users see GROMACS also suffer with srun under
> 1.6.5 they don't see anything like the slow down that NAMD gets.
> 
> All the best,
> Chris
> - -- 
> Christopher SamuelSenior Systems Administrator
> VLSCI - Victorian Life Sciences Computation Initiative
> Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
> http://www.vlsci.org.au/  http://twitter.com/vlsci
> 
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.11 (GNU/Linux)
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
> 
> iEYEARECAAYFAlImsCwACgkQO2KABBYQAh8c4wCfQlOd6ZL68tncAd1h3Fyb1hAr
> DicAn06seL8GzYPGtGImnYkb7sYd5op9
> =pkwZ
> -END PGP SIGNATURE-
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun

2013-09-04 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 04/09/13 11:29, Ralph Castain wrote:

> Your code is obviously doing something much more than just
> launching and wiring up, so it is difficult to assess the
> difference in speed between 1.6.5 and 1.7.3 - my guess is that it
> has to do with changes in the MPI transport layer and nothing to do
> with PMI or not.

I'm testing with what would be our most used application in aggregate
across our systems, the NAMD molecular dynamics code from here:

http://www.ks.uiuc.edu/Research/namd/

so yes,  you're quite right, it's doing a lot more than that and has a
reputation for being a *very* chatty MPI code.

For comparison whilst users see GROMACS also suffer with srun under
1.6.5 they don't see anything like the slow down that NAMD gets.

All the best,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlImsCwACgkQO2KABBYQAh8c4wCfQlOd6ZL68tncAd1h3Fyb1hAr
DicAn06seL8GzYPGtGImnYkb7sYd5op9
=pkwZ
-END PGP SIGNATURE-


Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun

2013-09-03 Thread Ralph Castain
Your code is obviously doing something much more than just launching and wiring 
up, so it is difficult to assess the difference in speed between 1.6.5 and 
1.7.3 - my guess is that it has to do with changes in the MPI transport layer 
and nothing to do with PMI or not.

Likewise, I can't imagine any differences in wireup method accounting for the 
500 seconds in execution time difference between the two versions when using 
the same launch method. I launch more than 10 nodes in far less time than that, 
so again I expect this has to do with something in the MPI layer.

The real question is why you see so much difference between launching via 
mpirun vs srun. Like I said, the launch and wireup times on such small scales 
is negligible, so somehow you are winding up selecting different MPI transport 
options. You can test this by just running "hello world" instead - I'll bet the 
mpirun vs srun time differences are a second or two at most.

Perhaps Jeff or someone else can suggest some debug flags you could use to 
understand these differences?



On Sep 3, 2013, at 6:13 PM, Christopher Samuel  wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> On 03/09/13 10:56, Ralph Castain wrote:
> 
>> Yeah - --with-pmi=
> 
> Actually I found that just --with-pmi=/usr/local/slurm/latest worked. :-)
> 
> I've got some initial numbers for 64 cores, as I mentioned the system
> I found this on initially is so busy at the moment I won't be able to
> run anything bigger for a while, so I'm going to move my testing to
> another system which is a bit quieter, but slower (it's Nehalem vs
> SandyBridge).
> 
> All the below tests are with the same NAMD 2.9 binary and within the
> same Slurm job so it runs on the same cores each time. It's nice to
> find that C code at least seems to be backwardly compatible!
> 
> 64 cores over 18 nodes:
> 
> Open-MPI 1.6.5 with mpirun - 7842 seconds
> Open-MPI 1.7.3a1r29103 with srun - 7522 seconds
> 
> so that's about a 4% speedup.
> 
> 64 cores over 10 nodes:
> 
> Open-MPI 1.7.3a1r29103 with mpirun - 8341 seconds
> Open-MPI 1.7.3a1r29103 with srun - 7476 seconds
> 
> So that's about 11% faster, and the mpirun speed has decreased though
> of course that's built using PMI so perhaps that's the cause?
> 
> cheers,
> Chris
> - -- 
> Christopher SamuelSenior Systems Administrator
> VLSCI - Victorian Life Sciences Computation Initiative
> Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
> http://www.vlsci.org.au/  http://twitter.com/vlsci
> 
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.11 (GNU/Linux)
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
> 
> iEUEARECAAYFAlImiUUACgkQO2KABBYQAh+WvwCeM1ufCWvK627oz8aBbgKjfONe
> cDEAmM3w+/EJ0unbmaetNR3ay4U6nrM=
> =v/PT
> -END PGP SIGNATURE-
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun

2013-09-03 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 03/09/13 10:56, Ralph Castain wrote:

> Yeah - --with-pmi=

Actually I found that just --with-pmi=/usr/local/slurm/latest worked. :-)

I've got some initial numbers for 64 cores, as I mentioned the system
I found this on initially is so busy at the moment I won't be able to
run anything bigger for a while, so I'm going to move my testing to
another system which is a bit quieter, but slower (it's Nehalem vs
SandyBridge).

All the below tests are with the same NAMD 2.9 binary and within the
same Slurm job so it runs on the same cores each time. It's nice to
find that C code at least seems to be backwardly compatible!

64 cores over 18 nodes:

Open-MPI 1.6.5 with mpirun - 7842 seconds
Open-MPI 1.7.3a1r29103 with srun - 7522 seconds

so that's about a 4% speedup.

64 cores over 10 nodes:

Open-MPI 1.7.3a1r29103 with mpirun - 8341 seconds
Open-MPI 1.7.3a1r29103 with srun - 7476 seconds

So that's about 11% faster, and the mpirun speed has decreased though
of course that's built using PMI so perhaps that's the cause?

cheers,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEUEARECAAYFAlImiUUACgkQO2KABBYQAh+WvwCeM1ufCWvK627oz8aBbgKjfONe
cDEAmM3w+/EJ0unbmaetNR3ay4U6nrM=
=v/PT
-END PGP SIGNATURE-


Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun

2013-09-02 Thread Ralph Castain
Yeah - --with-pmi=


On Sep 2, 2013, at 5:27 PM, Christopher Samuel  wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> On 31/08/13 02:42, Ralph Castain wrote:
> 
>> We did some work on the OMPI side and removed the O(N) calls to 
>> "get", so it should behave better now. If you get the chance,
>> please try the 1.7.3 nightly tarball. We hope to officially release
>> it soon.
> 
> Stupid question, but never having played with PMI before is it just
> the case of appending the --with-pmi option to our current configure?
> 
> thanks,
> Chris
> - -- 
> Christopher SamuelSenior Systems Administrator
> VLSCI - Victorian Life Sciences Computation Initiative
> Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
> http://www.vlsci.org.au/  http://twitter.com/vlsci
> 
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.11 (GNU/Linux)
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
> 
> iEYEARECAAYFAlIlLPAACgkQO2KABBYQAh9GhwCeN192n4g5PBHpeHwOi2Kpyhs3
> +X8An0TJ2VzrgeKl4+2YVVeZXq+6fz/W
> =h4Ip
> -END PGP SIGNATURE-
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel



Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun

2013-09-02 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 31/08/13 02:42, Ralph Castain wrote:

> We did some work on the OMPI side and removed the O(N) calls to 
> "get", so it should behave better now. If you get the chance,
> please try the 1.7.3 nightly tarball. We hope to officially release
> it soon.

Stupid question, but never having played with PMI before is it just
the case of appending the --with-pmi option to our current configure?

thanks,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlIlLPAACgkQO2KABBYQAh9GhwCeN192n4g5PBHpeHwOi2Kpyhs3
+X8An0TJ2VzrgeKl4+2YVVeZXq+6fz/W
=h4Ip
-END PGP SIGNATURE-


Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun

2013-09-02 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 31/08/13 02:42, Ralph Castain wrote:

> Hi Chris et al

Hiya,

> We did some work on the OMPI side and removed the O(N) calls to 
> "get", so it should behave better now. If you get the chance,
> please try the 1.7.3 nightly tarball. We hope to officially release
> it soon.

Thanks so much, I'll get our folks to rebuild a test version of NAMD
against 1.7.3a1r29103 which I built this afternoon.

It might be some time until I can get a test job of a suitable size to
run though, looks like our systems are flat out!

All the best,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlIkNsMACgkQO2KABBYQAh9AqgCggCUKRRLODZhfXUAJ6T2pYjGI
iSgAniISxkxnHXyEj7L6kmTs4wERy1rW
=31Qg
-END PGP SIGNATURE-


Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun

2013-07-24 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 24/07/13 09:42, Ralph Castain wrote:

> Not to 1.6 series, but it is in the about-to-be-released 1.7.3,
> and will be there from that point onwards.

Oh dear, I cannot delay this machine any more to change to 1.7.x. :-(

> Still waiting to see if it resolves the difference.

When I've got the current rush out of the way I'll try a private build
of 1.7 and see how that goes with NAMD.

cheers!
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlHvbl8ACgkQO2KABBYQAh9a6QCgi0HOHHV/opqjPq+Av+lTasaj
4OkAnA8i8ajZ9Umw7MoaH8qJbWBgFOAf
=p7Xl
-END PGP SIGNATURE-


Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun

2013-07-23 Thread Ralph Castain
Not to 1.6 series, but it is in the about-to-be-released 1.7.3, and will be 
there from that point onwards. Still waiting to see if it resolves the 
difference.


On Jul 23, 2013, at 4:28 PM, Christopher Samuel  wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
> 
> On 23/07/13 19:34, Joshua Ladd wrote:
> 
>> Hi, Chris
> 
> Hi Joshua,
> 
> I've quoted you in full as I don't think your message made it through
> to the slurm-dev list (at least I've not received it from there yet).
> 
>> Funny you should mention this now. We identified and diagnosed the 
>> issue some time ago as a combination of SLURM's PMI1
>> implementation and some of, what I'll call, OMPI's topology
>> requirements (probably not the right word.) Here's what is
>> happening, in a nutshell, when you launch with srun:
>> 
>> 1. Each process pushes his endpoint data up to the PMI "cloud" via
>> PMI put (I think it's about five or six puts, bottom line, O(1).) 
>> 2. Then executes a PMI commit and PMI barrier to ensure all other 
>> processes have finished committing their data to the "cloud". 3.
>> Subsequent to this, each process executes O(N) (N is the number of 
>> procs in the job) PMI gets in order to get all of the endpoint
>> data for every process regardless of whether or not the process 
>> communicates with that endpoint.
>> 
>> "We" (MLNX et al.) undertook an in-depth scaling study of this and 
>> identified several poorly scaling pieces with the worst offenders 
>> being:
>> 
>> 1. PMI Barrier scales worse than linear. 2. At scale, the PMI get
>> phase starts to look quadratic.
>> 
>> The proposed solution that "we" (OMPI + SLURM) have come up with is
>> to modify OMPI to support PMI2 and to use SLURM 2.6 which has
>> support for PMI2 and is (allegedly) much more scalable than PMI1.
>> Several folks in the combined communities are working hard, as we
>> speak, trying to get this functional to see if it indeed makes a
>> difference. Stay tuned, Chris. Hopefully we will have some data by
>> the end of the week.
> 
> Wonderful, great to know that what we're seeing is actually real and
> not just pilot error on our part!   We're happy enough to tell users
> to keep on using mpirun as they will be used to from our other Intel
> systems and to only use srun if the code requires it (one or two
> commercial apps that use Intel MPI).
> 
> Can I ask, if the PMI2 ideas work out is that likely to get backported
> to OMPI 1.6.x ?
> 
> All the best,
> Chris
> - -- 
> Christopher SamuelSenior Systems Administrator
> VLSCI - Victorian Life Sciences Computation Initiative
> Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
> http://www.vlsci.org.au/  http://twitter.com/vlsci
> 
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.11 (GNU/Linux)
> Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
> 
> iEYEARECAAYFAlHvEZIACgkQO2KABBYQAh9QogCeMuR/E4oPivdsX3r671+z7EWd
> Hv8An1N8csHMby7bouT/gC07i/J2PW+i
> =gZsB
> -END PGP SIGNATURE-
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel




Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun

2013-07-23 Thread Christopher Samuel
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 23/07/13 19:34, Joshua Ladd wrote:

> Hi, Chris

Hi Joshua,

I've quoted you in full as I don't think your message made it through
to the slurm-dev list (at least I've not received it from there yet).

> Funny you should mention this now. We identified and diagnosed the 
> issue some time ago as a combination of SLURM's PMI1
> implementation and some of, what I'll call, OMPI's topology
> requirements (probably not the right word.) Here's what is
> happening, in a nutshell, when you launch with srun:
> 
> 1. Each process pushes his endpoint data up to the PMI "cloud" via
> PMI put (I think it's about five or six puts, bottom line, O(1).) 
> 2. Then executes a PMI commit and PMI barrier to ensure all other 
> processes have finished committing their data to the "cloud". 3.
> Subsequent to this, each process executes O(N) (N is the number of 
> procs in the job) PMI gets in order to get all of the endpoint
> data for every process regardless of whether or not the process 
> communicates with that endpoint.
> 
> "We" (MLNX et al.) undertook an in-depth scaling study of this and 
> identified several poorly scaling pieces with the worst offenders 
> being:
> 
> 1. PMI Barrier scales worse than linear. 2. At scale, the PMI get
> phase starts to look quadratic.
> 
> The proposed solution that "we" (OMPI + SLURM) have come up with is
> to modify OMPI to support PMI2 and to use SLURM 2.6 which has
> support for PMI2 and is (allegedly) much more scalable than PMI1.
> Several folks in the combined communities are working hard, as we
> speak, trying to get this functional to see if it indeed makes a
> difference. Stay tuned, Chris. Hopefully we will have some data by
> the end of the week.

Wonderful, great to know that what we're seeing is actually real and
not just pilot error on our part!   We're happy enough to tell users
to keep on using mpirun as they will be used to from our other Intel
systems and to only use srun if the code requires it (one or two
commercial apps that use Intel MPI).

Can I ask, if the PMI2 ideas work out is that likely to get backported
to OMPI 1.6.x ?

All the best,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlHvEZIACgkQO2KABBYQAh9QogCeMuR/E4oPivdsX3r671+z7EWd
Hv8An1N8csHMby7bouT/gC07i/J2PW+i
=gZsB
-END PGP SIGNATURE-


Re: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed than with mpirun

2013-07-23 Thread Joshua Ladd
Hi, Chris

Funny you should mention this now. We identified and diagnosed the issue some 
time ago as a combination of SLURM's PMI1 implementation and some of, what I'll 
call, OMPI's topology requirements (probably not the right word.) Here's what 
is happening, in a nutshell, when you launch with srun:

1. Each process pushes his endpoint data up to the PMI "cloud" via PMI put (I 
think it's about five or six puts, bottom line, O(1).)
2. Then executes a PMI commit and PMI barrier to ensure all other processes 
have finished committing their data to the "cloud".
3.  Subsequent to this, each process executes O(N) (N is the number of procs in 
the job) PMI gets in order to get all of the endpoint data for every process 
regardless of whether or not the process communicates with that endpoint. 

"We" (MLNX et al.) undertook an in-depth scaling study of this and identified 
several poorly scaling pieces with the worst offenders being:

1. PMI Barrier scales worse than linear.
2. At scale, the PMI get phase starts to look quadratic.   

The proposed solution that "we" (OMPI + SLURM) have come up with is to modify 
OMPI to support PMI2 and to use SLURM 2.6 which has support for PMI2 and is 
(allegedly) much more scalable than PMI1. Several folks in the combined 
communities are working hard, as we speak, trying to get this functional to see 
if it indeed makes a difference. Stay tuned, Chris. Hopefully we will have some 
data by the end of the week.  

Best regards,

Josh


Joshua S. Ladd, PhD
HPC Algorithms Engineer
Mellanox Technologies 

Email: josh...@mellanox.com
Cell: +1 (865) 258 - 8898





-Original Message-
From: devel-boun...@open-mpi.org [mailto:devel-boun...@open-mpi.org] On Behalf 
Of Christopher Samuel
Sent: Tuesday, July 23, 2013 3:06 AM
To: slurm-dev; Open MPI Developers
Subject: [OMPI devel] Open-MPI build of NAMD launched from srun over 20% slowed 
than with mpirun

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi there slurm-dev and OMPI devel lists,

Bringing up a new IBM SandyBridge cluster I'm running a NAMD test case and 
noticed that if I run it with srun rather than mpirun it goes over 20% slower.  
These are all launched from an sbatch script too.

Slurm 2.6.0, RHEL 6.4 (latest kernel), FDR IB.

Here are some timings as reported as the WallClock time by NAMD itself (so not 
including startup/tear down overhead from Slurm).

srun:

run1/slurm-93744.out:WallClock: 695.079773  CPUTime: 695.079773
run4/slurm-94011.out:WallClock: 723.907959  CPUTime: 723.907959
run5/slurm-94013.out:WallClock: 726.156799  CPUTime: 726.156799
run6/slurm-94017.out:WallClock: 724.828918  CPUTime: 724.828918

Average of 692 seconds

mpirun:

run2/slurm-93746.out:WallClock: 559.311035  CPUTime: 559.311035
run3/slurm-93910.out:WallClock: 544.116333  CPUTime: 544.116333
run7/slurm-94019.out:WallClock: 586.072693  CPUTime: 586.072693

Average of 563 seconds.

So that's about 23% slower.

Everything is identical (they're all symlinks to the same golden
master) *except* for the srun / mpirun which is modified by copying the batch 
script and substituting mpirun for srun.

When they are running I can see that for jobs launched with srun they are 
direct children of slurmstepd whereas when started with mpirun they are 
children of Open-MPI's orted (or mpirun on the launch node) which itself is a 
child of slurmstepd.

Has anyone else seen anything like this, or got any ideas?

cheers,
Chris
- -- 
 Christopher SamuelSenior Systems Administrator
 VLSCI - Victorian Life Sciences Computation Initiative
 Email: sam...@unimelb.edu.au Phone: +61 (0)3 903 55545
 http://www.vlsci.org.au/  http://twitter.com/vlsci

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/

iEYEARECAAYFAlHuKxoACgkQO2KABBYQAh8cYQCfT/YIFkyeDaNb/ksT2xk4W416
kycAoJfdZInLwy+nTIL7CzWapZZU20qm
=ZJ1B
-END PGP SIGNATURE-
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel