Could the issue have anything to do with the how OMPI implements lazy connections with IBCM? Does setting the mca parameter mpi_preconnect_all to 1 change things?

--td

On 01/16/2011 04:12 AM, Doron Shoham wrote:
Hi,

The gather hangs only in liner_sync algorithm but works with
basic_linear and binomial algorithms.
The gather algorithm is choosen dynamiclly depanding on block size and
communicator size.
So, in the beginning, binomial algorithm is chosen (communicator size
is larger then 60).
When increasing the message size, the liner_sync algorithm is chosen
(with small_segment_size).
When debugging on the cluster I saw that the linear_sync function is
called in endless loop with segment size of 1024.
This explain why hang occure in the middle of the run.

I still don't understand why does RDMACM solve it or what causes
liner_sync hangs.

Again, in 1.5 it doesn't hang (maybe timing is different?).
I'm still trying to understand what are the diffrences in those areas
between 1.4.3 and 1.5


BTW,
Choosing RDMACM fixes hangs and performance issues in all collective operations.

Thanks,
Doron


On Thu, Jan 13, 2011 at 9:44 PM, Shamis, Pavel<sham...@ornl.gov>  wrote:
RDMACM creates the same QPs with the same tunings as OOB, so I don't see how 
CPC may effect on performance.

Pavel (Pasha) Shamis
---
Application Performance Tools Group
Computer Science and Math Division
Oak Ridge National Laboratory






On Jan 13, 2011, at 2:15 PM, Jeff Squyres wrote:

+1 on what Pasha said -- if using rdmacm fixes the problem, then there's 
something else nefarious going on...

You might want to check padb with your hangs to see where all the processes are 
hung to see if anything obvious jumps out.  I'd be surprised if there's a bug 
in the oob cpc; it's been around for a long, long time; it should be pretty 
stable.

Do we create QP's differently between oob and rdmacm, such that perhaps they are 
"better" (maybe better routed, or using a different SL, or ...) when created 
via rdmacm?


On Jan 12, 2011, at 12:12 PM, Shamis, Pavel wrote:

RDMACM or OOB can not effect on performance of this benchmark, since they are 
not involved in communication. So I'm not sure that the performance changes 
that you see are related to connection manager changes.
About oob - I'm not aware about hangs issue there, the code is very-very old, 
we did not touch it for a long time.

Regards,

Pavel (Pasha) Shamis
---
Application Performance Tools Group
Computer Science and Math Division
Oak Ridge National Laboratory
Email: sham...@ornl.gov





On Jan 12, 2011, at 8:45 AM, Doron Shoham wrote:

Hi,

For the first problem, I can see that when using rdmacm as openib oob
I get much better performence results (and no hangs!).

mpirun -display-map -np 64 -machinefile voltairenodes -mca btl
sm,self,openib -mca btl_openib_connect_rdmacm_priority 100
imb/src/IMB-MPI1 gather -npmin 64


#bytes      #repetitions        t_min[usec]     t_max[usec]     t_avg[usec]

0       1000        0.04        0.05        0.05

1       1000        19.64       19.69       19.67

2       1000        19.97       20.02       19.99

4       1000        21.86       21.96       21.89

8       1000        22.87       22.94       22.90

16      1000        24.71       24.80       24.76

32      1000        27.23       27.32       27.27

64      1000        30.96       31.06       31.01

128     1000        36.96       37.08       37.02

256     1000        42.64       42.79       42.72

512     1000        60.32       60.59       60.46

1024    1000        82.44       82.74       82.59

2048    1000        497.66      499.62      498.70

4096    1000        684.15      686.47      685.33

8192    519         544.07      546.68      545.85

16384   519         653.20      656.23      655.27

32768   519         704.48      707.55      706.60

65536   519         918.00      922.12      920.86

131072  320         2414.08     2422.17     2418.20

262144  160         4198.25     4227.58     4213.19

524288  80          7333.04     7503.99     7438.18

1048576 40          13692.60    14150.20    13948.75

2097152 20          30377.34    32679.15    31779.86

4194304 10          61416.70    71012.50    68380.04

How can the oob cause the hang? Isn't it only used to bring up the connection?
Does the oob has any part of the connections were made?

Thanks,
Dororn

On Tue, Jan 11, 2011 at 2:58 PM, Doron Shoham<doron.o...@gmail.com>  wrote:
Hi

All machines on the setup are IDataPlex with Nehalem 12 cores per node, 24GB  
memory.



·         Problem 1 – OMPI 1.4.3 hangs in gather:



I’m trying to run IMB and gather operation with OMPI 1.4.3 (Vanilla).

It happens when np>= 64 and message size exceed 4k:

mpirun -np 64 -machinefile voltairenodes -mca btl sm,self,openib  
imb/src-1.4.2/IMB-MPI1 gather –npmin 64



voltairenodes consists of 64 machines.



#----------------------------------------------------------------

# Benchmarking Gather

# #processes = 64

#----------------------------------------------------------------

     #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]

          0         1000         0.02         0.02         0.02

          1          331        14.02        14.16        14.09

          2          331        12.87        13.08        12.93

          4          331        14.29        14.43        14.34

          8          331        16.03        16.20        16.11

         16          331        17.54        17.74        17.64

         32          331        20.49        20.62        20.53

         64          331        23.57        23.84        23.70

        128          331        28.02        28.35        28.18

        256          331        34.78        34.88        34.80

        512          331        46.34        46.91        46.60

       1024          331        63.96        64.71        64.33

       2048          331       460.67       465.74       463.18

       4096          331       637.33       643.99       640.75



This the padb output:

padb –A –x –Ormgr=mpirun –tree:



=~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2011.01.06 14:33:17 =~=~=~=~=~=~=~=~=~=~=~=



Warning, remote process state differs across ranks

state : ranks

R (running) : 
[1,3-6,8,10-13,16-20,23-28,30-32,34-42,44-45,47-49,51-53,56-59,61-63]

S (sleeping) : [0,2,7,9,14-15,21-22,29,33,43,46,50,54-55,60]

Stack trace(s) for thread: 1

-----------------

[0-63] (64 processes)

-----------------

main() at ?:?

IMB_init_buffers_iter() at ?:?

  IMB_gather() at ?:?

    PMPI_Gather() at pgather.c:175

      mca_coll_sync_gather() at coll_sync_gather.c:46

        ompi_coll_tuned_gather_intra_dec_fixed() at 
coll_tuned_decision_fixed.c:714

          -----------------

          [0,3-63] (62 processes)

          -----------------

          ompi_coll_tuned_gather_intra_linear_sync() at coll_tuned_gather.c:248

            mca_pml_ob1_recv() at pml_ob1_irecv.c:104

              ompi_request_wait_completion() at 
../../../../ompi/request/request.h:375

                opal_condition_wait() at ../../../../opal/threads/condition.h:99

          -----------------

          [1] (1 processes)

          -----------------

          ompi_coll_tuned_gather_intra_linear_sync() at coll_tuned_gather.c:302

            mca_pml_ob1_send() at pml_ob1_isend.c:125

              ompi_request_wait_completion() at 
../../../../ompi/request/request.h:375

                opal_condition_wait() at ../../../../opal/threads/condition.h:99

          -----------------

          [2] (1 processes)

          -----------------

          ompi_coll_tuned_gather_intra_linear_sync() at coll_tuned_gather.c:315

            ompi_request_default_wait() at request/req_wait.c:37

              ompi_request_wait_completion() at ../ompi/request/request.h:375

                opal_condition_wait() at ../opal/threads/condition.h:99

Stack trace(s) for thread: 2

-----------------

[0-63] (64 processes)

-----------------

start_thread() at ?:?

btl_openib_async_thread() at btl_openib_async.c:344

  poll() at ?:?

Stack trace(s) for thread: 3

-----------------

[0-63] (64 processes)

-----------------

start_thread() at ?:?

service_thread_start() at btl_openib_fd.c:427

  select() at ?:?

-bash-3.2$





When running again padb after couple of minutes, I can see that the total 
number of processes remain in the same position but

different processes are at different positions.

For example, this is the diff between two padb outputs:



Warning, remote process state differs across ranks

state : ranks

-R (running) : [0,2-4,6-13,16-18,20-21,28-31,33-36,38-56,58,60,62-63]

-S (sleeping) : [1,5,14-15,19,22-27,32,37,57,59,61]

+R (running) : [2,5-14,16-23,25,28-40,42-48,50-51,53-58,61,63]

+S (sleeping) : [0-1,3-4,15,24,26-27,41,49,52,59-60,62]

Stack trace(s) for thread: 1

-----------------

[0-63] (64 processes)

@@ -13,21 +13,21 @@

mca_coll_sync_gather() at coll_sync_gather.c:46

ompi_coll_tuned_gather_intra_dec_fixed() at coll_tuned_decision_fixed.c:714

-----------------

- [0,3-63] (62 processes)

+ [0-5,8-63] (62 processes)

-----------------

ompi_coll_tuned_gather_intra_linear_sync() at coll_tuned_gather.c:248

mca_pml_ob1_recv() at pml_ob1_irecv.c:104

ompi_request_wait_completion() at ../../../../ompi/request/request.h:375

opal_condition_wait() at ../../../../opal/threads/condition.h:99

-----------------

- [1] (1 processes)

+ [6] (1 processes)

-----------------

ompi_coll_tuned_gather_intra_linear_sync() at coll_tuned_gather.c:302

mca_pml_ob1_send() at pml_ob1_isend.c:125

ompi_request_wait_completion() at ../../../../ompi/request/request.h:375

opal_condition_wait() at ../../../../opal/threads/condition.h:99

-----------------

- [2] (1 processes)

+ [7] (1 processes)

-----------------

ompi_coll_tuned_gather_intra_linear_sync() at coll_tuned_gather.c:315

ompi_request_default_wait() at request/req_wait.c:37





Choosing different gather algorithm seems to bypass the hang.

I’ve used the following mca parameters:

--mca coll_tuned_use_dynamic_rules 1

--mca coll_tuned_gather_algorithm 1



Actually, both dec_fixed and basic_linear works while binomial and linear_sync 
doesn’t.



With OMPI 1.5 it doesn’t hangs (with all gather algorithms) and it much faster 
(the number of repetitions is much higher):

#----------------------------------------------------------------

# Benchmarking Gather

# #processes = 64

#----------------------------------------------------------------

     #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]

          0         1000         0.02         0.03         0.02

          1         1000        18.50        18.55        18.53

          2         1000        18.17        18.25        18.22

          4         1000        19.04        19.10        19.07

          8         1000        19.60        19.67        19.64

         16         1000        21.39        21.47        21.43

         32         1000        24.83        24.91        24.87

         64         1000        27.35        27.45        27.40

        128         1000        33.23        33.34        33.29

        256         1000        41.24        41.39        41.32

        512         1000        52.62        52.81        52.71

       1024         1000        73.20        73.46        73.32

       2048         1000       416.36       418.04       417.22

       4096         1000       638.54       640.70       639.65

       8192         1000       506.26       506.97       506.63

      16384         1000       600.63       601.40       601.02

      32768         1000       639.52       640.34       639.93

      65536          640       914.22       916.02       915.13

     131072          320      2287.37      2295.18      2291.35

     262144          160      4041.36      4070.58      4056.27

     524288           80      7292.35      7463.27      7397.14

    1048576           40     13647.15     14107.15     13905.29

    2097152           20     30625.00     32635.45     31815.36

    4194304           10     63543.01     70987.49     68680.48





·         Problem 2 – segmentation fault with OMPI 1.4.3/1.5 and IMB gather 
np=768:

When trying to run the same command but with np=768 I get segmentation fault:

openmpi-1.4.3/bin/mpirun -np 768 -machinefile voltairenodes -mca btl 
sm,self,openib -mca coll_tuned_use_dynamic_rules 1 -mca 
coll_tuned_gather_algorithm 1 imb/src/IMB-MPI1 gather -npmin 768 -mem 1.6



This happens in OMPI 1.4.3 and 1.5



[compa163:20249] *** Process received signal ***

[compa163:20249] Signal: Segmentation fault (11)

[compa163:20249] Signal code: Address not mapped (1)

[compa163:20249] Failing at address: 0x2aab4a204000

[compa163:20249] [ 0] /lib64/libpthread.so.0 [0x366aa0e7c0]

[compa163:20249] [ 1] 
/gpfs/asrc/home/voltaire/install//openmpi-1.4.3/lib/libmpi.so.0(ompi_convertor_unpack+0x15f)
 [0x2b077882282e]

[compa163:20249] [ 2] 
/gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_pml_ob1.so 
[0x2b077b9e1672]

[compa163:20249] [ 3] 
/gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_pml_ob1.so 
[0x2b077b9dd0b6]

[compa163:20249] [ 4] 
/gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_btl_sm.so 
[0x2b077c459d87]

[compa163:20249] [ 5] 
/gpfs/asrc/home/voltaire/install//openmpi-1.4.3/lib/libopen-pal.so.0(opal_progress+0xbe)
 [0x2b0778d845b8]

[compa163:20249] [ 6] 
/gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_pml_ob1.so 
[0x2b077b9d6d62]

[compa163:20249] [ 7] 
/gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_pml_ob1.so 
[0x2b077b9d6ba7]

[compa163:20249] [ 8] 
/gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_pml_ob1.so 
[0x2b077b9d6a90]

[compa163:20249] [ 9] 
/gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_coll_tuned.so 
[0x2b077d298dc5]

[compa163:20249] [10] 
/gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_coll_tuned.so 
[0x2b077d2990d3]

[compa163:20249] [11] 
/gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_coll_tuned.so 
[0x2b077d286e9b]

[compa163:20249] [12] 
/gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_coll_sync.so 
[0x2b077d07e96c]

[compa163:20249] [13] 
/gpfs/asrc/home/voltaire/install//openmpi-1.4.3/lib/libmpi.so.0(PMPI_Gather+0x55e)
 [0x2b077883ec9a]

[compa163:20249] [14] imb/src/IMB-MPI1(IMB_gather+0xe8) [0x40a088]

[compa163:20249] [15] imb/src/IMB-MPI1(IMB_init_buffers_iter+0x28a) [0x405baa]

[compa163:20249] [16] imb/src/IMB-MPI1(main+0x30f) [0x40362f]

[compa163:20249] [17] /lib64/libc.so.6(__libc_start_main+0xf4) [0x3669e1d994]

[compa163:20249] [18] imb/src/IMB-MPI1 [0x403269]

[compa163:20249] *** End of error message ***


Any ideas? More debuggin tips?

Thanks,
Doron
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/


_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com <mailto:terry.don...@oracle.com>



Reply via email to