Re: [OMPI devel] Maximum Shared Memory Segment - OK to increase?

2007-08-28 Thread Markus Daene
Rolf,

I think it is not a good idea to increase the default value to 2G.  You
have to keep in mind that there are not so many people who have  a 
machine with 128 and more cores on a single node. The average people
will have nodes with 2,4 maybe 8 cores and therefore it is not necessary
to set this parameter to such a high value. Eventually it allocates all
of this memory per node, and if you have only 4 or 8G per node it will
be inbalanced. For my 8core nodes I have even decreased the sm_max_size
to 32G and I had no problems with that. As far as I know (if not
otherwise specified during runtime) this parameter is global. So even if
you  run on your machine with 2 procs it might allocate the 2G for the
MPI smp module.
I would recommend like Richard suggests to set the parameter for your
machine in
etc/openmpi-mca-params.conf
and not to change the default value.

Markus


Rolf vandeVaart wrote:
> We are running into a problem when running on one of our larger SMPs
> using the latest Open MPI v1.2 branch.  We are trying to run a job
> with np=128 within a single node.  We are seeing the following error:
>
> "SM failed to send message due to shortage of shared memory."
>
> We then increased the allowable maximum size of the shared segment to
> 2Gigabytes-1 which is the maximum allowed on 32-bit application.  We
> used the mca parameter to increase it as shown here.
>
> -mca mpool_sm_max_size 2147483647
>
> This allowed the program to run to completion.  Therefore, we would
> like to increase the default maximum from 512Mbytes to 2G-1 Gigabytes.
> Does anyone have an objection to this change?  Soon we are going to
> have larger CPU counts and would like to increase the odds that things
> work "out of the box" on these large SMPs.
>
> On a side note, I did a quick comparison of the shared memory needs of
> the old Sun ClusterTools to Open MPI and came up with this table.
>  
>  Open MPI
> np  Sun ClusterTools 6current   suggested
> -
>   2 20M  128M128M
>   4 20M  128M128M
>   8 22M  256M256M
>  16 27M  512M512M
>  32 48M  512M  1G
>  64133M  512M2G-1
> 128476M  512M2G-1
>
> ___
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>   



Re: [OMPI devel] Maximum Shared Memory Segment - OK to increase?

2007-08-28 Thread Terry D. Dontje
Maybe an clarification of the SM BTL implementation is needed.  Does the 
SM BTL not set a limit based on np using the max allowable as a 
ceiling?  If not and all jobs are allowed to use up to max allowable I 
see the reason for not wanting to raise the max allowable. 

That being said it seems to me that the memory usage of the SM BTL is a 
lot larger than it should be.   Wasn't there some work done around June 
that looked why the SM BTL was allocating a lot of memory, anything come 
out of that? 


--td


Markus Daene wrote:


Rolf,

I think it is not a good idea to increase the default value to 2G.  You
have to keep in mind that there are not so many people who have  a 
machine with 128 and more cores on a single node. The average people

will have nodes with 2,4 maybe 8 cores and therefore it is not necessary
to set this parameter to such a high value. Eventually it allocates all
of this memory per node, and if you have only 4 or 8G per node it will
be inbalanced. For my 8core nodes I have even decreased the sm_max_size
to 32G and I had no problems with that. As far as I know (if not
otherwise specified during runtime) this parameter is global. So even if
you  run on your machine with 2 procs it might allocate the 2G for the
MPI smp module.
I would recommend like Richard suggests to set the parameter for your
machine in
etc/openmpi-mca-params.conf
and not to change the default value.

Markus


Rolf vandeVaart wrote:
 


We are running into a problem when running on one of our larger SMPs
using the latest Open MPI v1.2 branch.  We are trying to run a job
with np=128 within a single node.  We are seeing the following error:

"SM failed to send message due to shortage of shared memory."

We then increased the allowable maximum size of the shared segment to
2Gigabytes-1 which is the maximum allowed on 32-bit application.  We
used the mca parameter to increase it as shown here.

-mca mpool_sm_max_size 2147483647

This allowed the program to run to completion.  Therefore, we would
like to increase the default maximum from 512Mbytes to 2G-1 Gigabytes.
Does anyone have an objection to this change?  Soon we are going to
have larger CPU counts and would like to increase the odds that things
work "out of the box" on these large SMPs.

On a side note, I did a quick comparison of the shared memory needs of
the old Sun ClusterTools to Open MPI and came up with this table.

Open MPI
np  Sun ClusterTools 6current   suggested
-
 2 20M  128M128M
 4 20M  128M128M
 8 22M  256M256M
16 27M  512M512M
32 48M  512M  1G
64133M  512M2G-1
128476M  512M2G-1

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
 
   



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
 





Re: [OMPI devel] Maximum Shared Memory Segment - OK to increase?

2007-08-28 Thread Rolf . Vandevaart


There are 3 parameters that control how much memory is used by the SM BTL.

MCA mpool: parameter "mpool_sm_max_size" (current value: "536870912")
   Maximum size of the sm mpool shared memory file
MCA mpool: parameter "mpool_sm_min_size" (current value: "134217728")
   Minimum size of the sm mpool shared memory file
MCA mpool: parameter "mpool_sm_per_peer_size" (current value: "33554432")
   Size (in bytes) to allocate per local peer in the sm mpool 
shared memory file,

   bounded by min_size and max_size

To paraphrase the above, the default ceiling is 512M, the default floor 
is 128M,
and the scaling factor is 32M*procs_on_node.   Therefore, changing it 
would only

effect cases where there were more than 16 processes on a node. (16*32=512M)

My suggestion was to increase the ceiling from 512M to 2G-1.  And yes, we
could adjust as suggested by Rich by adding setting the parameter in
our customized openmpi-mca-params.conf file.   I just was not sure
that was the optimal solution.

Rolf


Terry D. Dontje wrote:

Maybe an clarification of the SM BTL implementation is needed.  Does the 
SM BTL not set a limit based on np using the max allowable as a 
ceiling?  If not and all jobs are allowed to use up to max allowable I 
see the reason for not wanting to raise the max allowable. 

That being said it seems to me that the memory usage of the SM BTL is a 
lot larger than it should be.   Wasn't there some work done around June 
that looked why the SM BTL was allocating a lot of memory, anything come 
out of that? 


--td


Markus Daene wrote:

 


Rolf,

I think it is not a good idea to increase the default value to 2G.  You
have to keep in mind that there are not so many people who have  a 
machine with 128 and more cores on a single node. The average people

will have nodes with 2,4 maybe 8 cores and therefore it is not necessary
to set this parameter to such a high value. Eventually it allocates all
of this memory per node, and if you have only 4 or 8G per node it will
be inbalanced. For my 8core nodes I have even decreased the sm_max_size
to 32G and I had no problems with that. As far as I know (if not
otherwise specified during runtime) this parameter is global. So even if
you  run on your machine with 2 procs it might allocate the 2G for the
MPI smp module.
I would recommend like Richard suggests to set the parameter for your
machine in
etc/openmpi-mca-params.conf
and not to change the default value.

Markus


Rolf vandeVaart wrote:


   


We are running into a problem when running on one of our larger SMPs
using the latest Open MPI v1.2 branch.  We are trying to run a job
with np=128 within a single node.  We are seeing the following error:

"SM failed to send message due to shortage of shared memory."

We then increased the allowable maximum size of the shared segment to
2Gigabytes-1 which is the maximum allowed on 32-bit application.  We
used the mca parameter to increase it as shown here.

-mca mpool_sm_max_size 2147483647

This allowed the program to run to completion.  Therefore, we would
like to increase the default maximum from 512Mbytes to 2G-1 Gigabytes.
Does anyone have an objection to this change?  Soon we are going to
have larger CPU counts and would like to increase the odds that things
work "out of the box" on these large SMPs.

On a side note, I did a quick comparison of the shared memory needs of
the old Sun ClusterTools to Open MPI and came up with this table.

   Open MPI
np  Sun ClusterTools 6current   suggested
-
2 20M  128M128M
4 20M  128M128M
8 22M  256M256M
16 27M  512M512M
32 48M  512M  1G
64133M  512M2G-1
128476M  512M2G-1

___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

  

 


___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


   



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
 





Re: [OMPI devel] Maximum Shared Memory Segment - OK to increase?

2007-08-28 Thread Li-Ta Lo
On Mon, 2007-08-27 at 15:10 -0400, Rolf vandeVaart wrote:
> We are running into a problem when running on one of our larger SMPs
> using the latest Open MPI v1.2 branch.  We are trying to run a job
> with np=128 within a single node.  We are seeing the following error:
> 
> "SM failed to send message due to shortage of shared memory."
> 
> We then increased the allowable maximum size of the shared segment to
> 2Gigabytes-1 which is the maximum allowed on 32-bit application.  We
> used the mca parameter to increase it as shown here.
> 
> -mca mpool_sm_max_size 2147483647
> 
> This allowed the program to run to completion.  Therefore, we would
> like to increase the default maximum from 512Mbytes to 2G-1 Gigabytes.
> Does anyone have an objection to this change?  Soon we are going to
> have larger CPU counts and would like to increase the odds that things
> work "out of the box" on these large SMPs.
> 


There is a serious problem with the 1.2 branch, it does not allocate
any SM area for each process at the beginning. SM areas are allocated
on demand and if some of the processes are more aggressive than the
others, it will cause starvation. This problem is fixed in the trunk
by assign at least one SM area for each process. I think this is what
you saw (starvation) and an increase of max size may not be necessary.

Ollie




Re: [OMPI devel] Maximum Shared Memory Segment - OK to increase?

2007-08-28 Thread Brian Barrett

On Aug 28, 2007, at 9:05 AM, Li-Ta Lo wrote:


On Mon, 2007-08-27 at 15:10 -0400, Rolf vandeVaart wrote:

We are running into a problem when running on one of our larger SMPs
using the latest Open MPI v1.2 branch.  We are trying to run a job
with np=128 within a single node.  We are seeing the following error:

"SM failed to send message due to shortage of shared memory."

We then increased the allowable maximum size of the shared segment to
2Gigabytes-1 which is the maximum allowed on 32-bit application.  We
used the mca parameter to increase it as shown here.

-mca mpool_sm_max_size 2147483647

This allowed the program to run to completion.  Therefore, we would
like to increase the default maximum from 512Mbytes to 2G-1  
Gigabytes.

Does anyone have an objection to this change?  Soon we are going to
have larger CPU counts and would like to increase the odds that  
things

work "out of the box" on these large SMPs.




There is a serious problem with the 1.2 branch, it does not allocate
any SM area for each process at the beginning. SM areas are allocated
on demand and if some of the processes are more aggressive than the
others, it will cause starvation. This problem is fixed in the trunk
by assign at least one SM area for each process. I think this is what
you saw (starvation) and an increase of max size may not be necessary.


Although I'm pretty sure this is fixed in the v1.2 branch already.

I don't think we should raise that ceiling at this point.  We create  
the file in /tmp, and if someone does -np 32 on a single, small node  
(not unheard of), it'll do really evil things.


Personally, I don't think we need nearly as much shared memory as  
we're using.  It's a bad design in terms of its unbounded memory  
usage.  We should fix that, rather than making the file bigger.  But  
I'm not going to fix it, so take my opinion with a grain of salt.


Brian


Re: [OMPI devel] thread model

2007-08-28 Thread Greg Watson


On Aug 27, 2007, at 10:04 PM, Jeff Squyres wrote:


On Aug 27, 2007, at 2:50 PM, Greg Watson wrote:


Until now I haven't had to worry about the opal/orte thread model.
However, there are now people who would like to use ompi that has
been configured with --with-threads=posix and --with-enable-mpi-
threads. Can someone give me some pointers as to what I need to do in
order to make sure I don't violate any threading model?


Note that this is *NOT* well tested.  There is work going on right
now to make the OMPI layer be able to support MPI_THREAD_MULTIPLE
(support was designed in from the beginning, but we haven't ever done
any kind of comprehensive testing/stressing of multi-thread support
such that it is pretty much guaranteed not to work), but it is
occurring on the trunk (i.e., what will eventually become v1.3) --
not the v1.2 branch.


The interfaces I'm calling are:

opal_event_loop()


Brian or George will have to answer about that one...


opal_path_findv()


This guy should be multi-thread safe (disclaimer: haven't tested it
myself); it doesn't rely on any global state.


orte_init()
orte_ns.create_process_name()
orte_iof.iof_subscribe()
orte_iof.iof_unsubscribe()
orte_schema.get_job_segment_name()
orte_gpr.get()
orte_dss.get()
orte_rml.send_buffer()
orte_rmgr.spawn_job()
orte_pls.terminate_job()
orte_rds.query()
orte_smr.job_stage_gate_subscribe()
orte_rmgr.get_vpid_range()


Note that all of ORTE is *NOT* thread safe, nor is it planned to be
(it just seemed way more trouble than it was worth).  You need to
serialize access to it.



Does that mean just calling OPAL_THREAD_LOCK() and OPAL_THREAD_UNLOCK 
() around each?


Greg



Re: [OMPI devel] Maximum Shared Memory Segment - OK to increase?

2007-08-28 Thread Li-Ta Lo
On Tue, 2007-08-28 at 10:12 -0600, Brian Barrett wrote:
> On Aug 28, 2007, at 9:05 AM, Li-Ta Lo wrote:
> 
> > On Mon, 2007-08-27 at 15:10 -0400, Rolf vandeVaart wrote:
> >> We are running into a problem when running on one of our larger SMPs
> >> using the latest Open MPI v1.2 branch.  We are trying to run a job
> >> with np=128 within a single node.  We are seeing the following error:
> >>
> >> "SM failed to send message due to shortage of shared memory."
> >>
> >> We then increased the allowable maximum size of the shared segment to
> >> 2Gigabytes-1 which is the maximum allowed on 32-bit application.  We
> >> used the mca parameter to increase it as shown here.
> >>
> >> -mca mpool_sm_max_size 2147483647
> >>
> >> This allowed the program to run to completion.  Therefore, we would
> >> like to increase the default maximum from 512Mbytes to 2G-1  
> >> Gigabytes.
> >> Does anyone have an objection to this change?  Soon we are going to
> >> have larger CPU counts and would like to increase the odds that  
> >> things
> >> work "out of the box" on these large SMPs.
> >>
> >
> >
> > There is a serious problem with the 1.2 branch, it does not allocate
> > any SM area for each process at the beginning. SM areas are allocated
> > on demand and if some of the processes are more aggressive than the
> > others, it will cause starvation. This problem is fixed in the trunk
> > by assign at least one SM area for each process. I think this is what
> > you saw (starvation) and an increase of max size may not be necessary.
> 
> Although I'm pretty sure this is fixed in the v1.2 branch already.
> 

It should never happen for the new code. The only way we can get the
message is when MCA_BTL_SM_FIFO_WRITE return rc != OMPI_SUCCESS, but
the new MCA_BTL_SM_FIFO_WRITE always return rc = OMPI_SUCCESS

#define MCA_BTL_SM_FIFO_WRITE(endpoint_peer,
my_smp_rank,peer_smp_rank,hdr,rc) \
do { \
ompi_fifo_t* fifo; \
fifo=&(mca_btl_sm_component.fifo[peer_smp_rank][my_smp_rank]); \
 \
/* thread lock */ \
if(opal_using_threads()) \
opal_atomic_lock(fifo->head_lock); \
/* post fragment */ \
while(ompi_fifo_write_to_head(hdr, fifo, \
mca_btl_sm_component.sm_mpool) != OMPI_SUCCESS) \
opal_progress(); \
MCA_BTL_SM_SIGNAL_PEER(endpoint_peer); \
rc=OMPI_SUCCESS; \
if(opal_using_threads()) \
opal_atomic_unlock(fifo->head_lock); \
} while(0)

Rolf, are you using the really last 1.2 branch?

Ollie




Re: [OMPI devel] [devel-core] [RFC] Runtime Services Layer

2007-08-28 Thread Ralph H Castain



On 8/27/07 7:30 AM, "Tim Prins"  wrote:

> Ralph,
> 
> Ralph H Castain wrote:
>> Just returned from vacation...sorry for delayed response
> No Problem. Hope you had a good vacation :) And sorry for my super
> delayed response. I have been pondering this a bit.
> 
>> In the past, I have expressed three concerns about the RSL.
>> 
>> 
>> My bottom line recommendation: I have no philosophical issue with the RSL
>> concept. However, I recommend holding off until the next version of ORTE is
>> completed and then re-evaluating to see how valuable the RSL might be, as
>> that next version will include memory footprint reduction and framework
>> consolidation that may yield much of the RSL's value without the extra work.
>> 
>> 
>> Long version:
>> 
>> 1. What problem are we really trying to solve?
>> If the RSL is intended to solve the Cray support problem (where the Cray OS
>> really just wants to see OMPI, not ORTE), then it may have some value. The
>> issue to date has revolved around the difficulty of maintaining the Cray
>> port in the face of changes to ORTE - as new frameworks are added, special
>> components for Cray also need to be created to provide a "do-nothing"
>> capability. In addition, the Cray is memory constrained, and the ORTE
>> library occupies considerable space while providing very little
>> functionality.
> This is definitely a motivation, but not the only one.

So...what are the others?

> 
>> The degree of value provide by the RSL will therefore depend somewhat on the
>> efficacy of the changes in development within ORTE. Those changes will,
>> among other things, significantly consolidate and reduce the number of
>> frameworks, and reduce the memory footprint. The expectation is that the
>> result will require only a single CNOS component in one framework. It isn't
>> clear, therefore, that the RSL will provide a significant value in that
>> environment.
> But won't there still be a lot of orte code linked in that will never be
> used?

Not really. The only thing left would be the stuff in runtime and util.

We have talked for years about creating an ORTE "services" framework -
basically, combining what is now in the runtime and util directories into a
single framework ala "svcs". The notion was that everything OS-specific
would go in there. What has held up implementation is (a) some thought that
maybe those things should go into OPAL instead of ORTE, and (b) low priority
and more important things to do.

However, if someone went ahead and implemented that idea, then you would
have a "NULL" component in the base that basically does a "no-op", and a
"default" component that provides actual services. Thus, for CNOS, you would
take the NULL component (so you don't open the framework's components and
avoid that memory overhead), and away you go.

I don't see how the RSL does anything better. Admittedly, you wouldn't have
to maintain the svcs APIs, but that doesn't seem any more onerous than
maintaining the RSL APIs as we change the MPI/RTE interfaces.

> 
> Also, a RSL would simplify ORTE in that there would be no need to do
> anything special for CNOs in it.

But if all I do is remove the ORTE cnos component and add an RSL cnos
component...what have I simplified?

> 
>> 
>> If the RSL is intended to aid in ORTE development, as hinted at in the RFC,
>> then I believe that is questionable. Developing ORTE in a tmp branch has
>> proven reasonably effective as changes to the MPI layer are largely
>> invisible to ORTE. Creating another layer to the system that would also have
>> to be maintained seems like a non-productive way of addressing any problems
>> in that area.
> Whether or not it would help in orte development remains to be seen. I
> just say that it might. Although I would argue that developing in tmp
> branches has caused a lot of problems with merging, etc.

Guess I don't see how this would solve the merge problems...but whatever.

> 
>> If the RSL is intended as a means of "freezing" the MPI-RTE interface, then
>> I believe we could better attain that objective by simply defining a set of
>> requirements for the RTE. As I'll note below, freezing the interface at an
>> API level could negatively impact other Open MPI objectives.
> It is intended to easily allow the development and use of other runtime
> systems, so simply defining requirements is not enough.

Could you please give some examples of these other runtimes?? Or is this
just hypothetical at this time?


> 
>> 2. Who is going to maintain old RTE versions, and why?
>> It isn't clear to me why anyone would want to do this - are we seriously
>> proposing that we maintain support for the ORTE layer that shipped with Open
>> MPI 1.0?? Can someone explain why we would want to do that?
> I highly doubt anyone would, and see no reason to include support for
> older runtime versions. Again, the purpose is to be able to run
> different runtimes. The ability to run different versions of the same
> runtime is just a side-effect.

[OMPI devel] Patch for reporter and friends

2007-08-28 Thread Jeff Squyres

Attached is a patch for the PHP side of things that does the following:

 * Creates a config.inc file for centralization of various user- 
settable parameters:
   * HTTP username/password for curl (passwords still protected; see  
code)

   * MTT database name/username/password
   * HTML header / footer
   * Google Analytics account number
 * Use the config.inc values in reporter, stats, and submit
 * Preliminary GA integration; if GA account number set in config.inc:
   * Report actual reporter URL
   * Report stats URL
   * Note that submits are not tracked by GA because the MTT client  
does not understand javascript
 * Moved "deny_mirror" functionality out of report.inc to config.inc  
because it's very www.open-mpi.org-specific


--
Jeff Squyres
Cisco Systems



mtt-php.patch
Description: Binary data




Re: [OMPI devel] Patch for reporter and friends

2007-08-28 Thread Jeff Squyres

@#$%@#$%

Sorry; I keep sending to devel instead of mtt-devel.

On Aug 28, 2007, at 2:48 PM, Jeff Squyres wrote:

Attached is a patch for the PHP side of things that does the  
following:


 * Creates a config.inc file for centralization of various user- 
settable parameters:
   * HTTP username/password for curl (passwords still protected;  
see code)

   * MTT database name/username/password
   * HTML header / footer
   * Google Analytics account number
 * Use the config.inc values in reporter, stats, and submit
 * Preliminary GA integration; if GA account number set in config.inc:
   * Report actual reporter URL
   * Report stats URL
   * Note that submits are not tracked by GA because the MTT client  
does not understand javascript
 * Moved "deny_mirror" functionality out of report.inc to  
config.inc because it's very www.open-mpi.org-specific


--
Jeff Squyres
Cisco Systems



___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel



--
Jeff Squyres
Cisco Systems



[OMPI devel] UD BTL alltoall hangs

2007-08-28 Thread Andrew Friedley
I'm having a problem with the UD BTL and hoping someone might have some 
input to help solve it.


What I'm seeing is hangs when running alltoall benchmarks with nbcbench 
or an LLNL program called mpiBench -- both hang exactly the same way. 
With the code on the trunk running nbcbench on IU's odin using 32 nodes 
and a command line like this:


mpirun -np 128 -mca btl ofud,self ./nbcbench -t MPI_Alltoall -p 128-128 
-s 1-262144


hangs consistently when testing 256-byte messages.  There are two things 
I can do to make the hang go away until running at larger scale.  First 
is to increase the 'btl_ofud_sd_num' MCA param from its default value of 
128.  This allows you to run with more procs/nodes before hitting the 
hang, but AFAICT doesn't fix the actual problem.  What this parameter 
does is control the maximum number of outstanding send WQEs posted at 
the IB level -- when the limit is reached, frags are queued on an 
opal_list_t and later sent by progress as IB sends complete.


The other way I've found is to play games with calling 
mca_btl_ud_component_progress() in mca_btl_ud_endpoint_post_send().  In 
fact I replaced the CHECK_FRAG_QUEUES() macro used around 
btl_ofud_endpoint.c:77 with a version that loops on progress until a 
send WQE slot is available (as opposed to queueing).  Same result -- I 
can run at larger scale, but still hit the hang eventually.


It appears that when the job hangs, progress is being polled very 
quickly, and after spinning for a while there are no outstanding send 
WQEs or queued sends in the BTL.  I'm not sure where further up things 
are spinning/blocking, as I can't produce the hang at less than 32 nodes 
/ 128 procs and don't have a good way of debugging that (suggestions 
appreciated).


Furthermore, both ob1 and dr PMLs result in the same behavior, except 
that DR eventually trips a watchdog timeout, fails the BTL, and 
terminates the job.


Other collectives such as allreduce and allgather do not hang -- only 
alltoall.  I can also reproduce the hang on LLNL's Atlas machine.


Can anyone else reproduce this (Torsten might have to make a copy of 
nbcbench available)?  Anyone have any ideas as to what's wrong?


Andrew


Re: [OMPI devel] UD BTL alltoall hangs

2007-08-28 Thread George Bosilca
The first step will be to figure out which version of the alltoall  
you're using. I suppose you use the default parameters, and then the  
decision function in the tuned component say it is using the linear  
all to all. As the name state it, this means that every node will  
post one receive from any other node and then will start sending to  
every other node the respective fragment. This will lead to a lot of  
outstanding sends and receives. I doubt that the receive can cause a  
problem, so I expect the problem is coming from the send side.


Do you have TotalView installed on your odin ? If yes there is a  
simple way to see how many sends are pending and where ... That might  
pinpoint [at least] the process where you should look to see what'  
wrong.


  george.

On Aug 29, 2007, at 12:37 AM, Andrew Friedley wrote:

I'm having a problem with the UD BTL and hoping someone might have  
some

input to help solve it.

What I'm seeing is hangs when running alltoall benchmarks with  
nbcbench

or an LLNL program called mpiBench -- both hang exactly the same way.
With the code on the trunk running nbcbench on IU's odin using 32  
nodes

and a command line like this:

mpirun -np 128 -mca btl ofud,self ./nbcbench -t MPI_Alltoall -p  
128-128

-s 1-262144

hangs consistently when testing 256-byte messages.  There are two  
things
I can do to make the hang go away until running at larger scale.   
First
is to increase the 'btl_ofud_sd_num' MCA param from its default  
value of

128.  This allows you to run with more procs/nodes before hitting the
hang, but AFAICT doesn't fix the actual problem.  What this parameter
does is control the maximum number of outstanding send WQEs posted at
the IB level -- when the limit is reached, frags are queued on an
opal_list_t and later sent by progress as IB sends complete.

The other way I've found is to play games with calling
mca_btl_ud_component_progress() in mca_btl_ud_endpoint_post_send 
().  In

fact I replaced the CHECK_FRAG_QUEUES() macro used around
btl_ofud_endpoint.c:77 with a version that loops on progress until a
send WQE slot is available (as opposed to queueing).  Same result -- I
can run at larger scale, but still hit the hang eventually.

It appears that when the job hangs, progress is being polled very
quickly, and after spinning for a while there are no outstanding send
WQEs or queued sends in the BTL.  I'm not sure where further up things
are spinning/blocking, as I can't produce the hang at less than 32  
nodes

/ 128 procs and don't have a good way of debugging that (suggestions
appreciated).

Furthermore, both ob1 and dr PMLs result in the same behavior, except
that DR eventually trips a watchdog timeout, fails the BTL, and
terminates the job.

Other collectives such as allreduce and allgather do not hang -- only
alltoall.  I can also reproduce the hang on LLNL's Atlas machine.

Can anyone else reproduce this (Torsten might have to make a copy of
nbcbench available)?  Anyone have any ideas as to what's wrong?

Andrew
___
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel