Re: [OMPI devel] OMPI 1.4.3 hangs in gather

Jeff Squyres Tue, 18 Jan 2011 07:49:00 -0500

+1.  I'm afraid I don't know why offhand there would be such differences.  I'm 
thinking that you'll need to dive a little deeper to figure it out; sorry.  :-(



On Jan 16, 2011, at 10:54 AM, Shamis, Pavel wrote:

> Well, then I would suspect rdmacm vs oob QP configuration. They supposed to 
> be the same, but  probably it's some bug there, and  somehow rdmacm QP tuning 
> different from oob, it is potential source cause for the performance 
> differences that you see.
> 
> Regards,
> Pavel (Pasha) Shamis
> ---
> Application Performance Tools Group
> Computer Science and Math Division
> Oak Ridge National Laboratory
> 
> 
> 
> 
> 
> 
> On Jan 16, 2011, at 4:12 AM, Doron Shoham wrote:
> 
>> Hi,
>> 
>> The gather hangs only in liner_sync algorithm but works with
>> basic_linear and binomial algorithms.
>> The gather algorithm is choosen dynamiclly depanding on block size and
>> communicator size.
>> So, in the beginning, binomial algorithm is chosen (communicator size
>> is larger then 60).
>> When increasing the message size, the liner_sync algorithm is chosen
>> (with small_segment_size).
>> When debugging on the cluster I saw that the linear_sync function is
>> called in endless loop with segment size of 1024.
>> This explain why hang occure in the middle of the run.
>> 
>> I still don't understand why does RDMACM solve it or what causes
>> liner_sync hangs.
>> 
>> Again, in 1.5 it doesn't hang (maybe timing is different?).
>> I'm still trying to understand what are the diffrences in those areas
>> between 1.4.3 and 1.5
>> 
>> 
>> BTW,
>> Choosing RDMACM fixes hangs and performance issues in all collective 
>> operations.
>> 
>> Thanks,
>> Doron
>> 
>> 
>> On Thu, Jan 13, 2011 at 9:44 PM, Shamis, Pavel <sham...@ornl.gov> wrote:
>>> RDMACM creates the same QPs with the same tunings as OOB, so I don't see 
>>> how CPC may effect on performance.
>>> 
>>> Pavel (Pasha) Shamis
>>> ---
>>> Application Performance Tools Group
>>> Computer Science and Math Division
>>> Oak Ridge National Laboratory
>>> 
>>> 
>>> 
>>> 
>>> 
>>> 
>>> On Jan 13, 2011, at 2:15 PM, Jeff Squyres wrote:
>>> 
>>>> +1 on what Pasha said -- if using rdmacm fixes the problem, then there's 
>>>> something else nefarious going on...
>>>> 
>>>> You might want to check padb with your hangs to see where all the 
>>>> processes are hung to see if anything obvious jumps out.  I'd be surprised 
>>>> if there's a bug in the oob cpc; it's been around for a long, long time; 
>>>> it should be pretty stable.
>>>> 
>>>> Do we create QP's differently between oob and rdmacm, such that perhaps 
>>>> they are "better" (maybe better routed, or using a different SL, or ...) 
>>>> when created via rdmacm?
>>>> 
>>>> 
>>>> On Jan 12, 2011, at 12:12 PM, Shamis, Pavel wrote:
>>>> 
>>>>> RDMACM or OOB can not effect on performance of this benchmark, since they 
>>>>> are not involved in communication. So I'm not sure that the performance 
>>>>> changes that you see are related to connection manager changes.
>>>>> About oob - I'm not aware about hangs issue there, the code is very-very 
>>>>> old, we did not touch it for a long time.
>>>>> 
>>>>> Regards,
>>>>> 
>>>>> Pavel (Pasha) Shamis
>>>>> ---
>>>>> Application Performance Tools Group
>>>>> Computer Science and Math Division
>>>>> Oak Ridge National Laboratory
>>>>> Email: sham...@ornl.gov
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> On Jan 12, 2011, at 8:45 AM, Doron Shoham wrote:
>>>>> 
>>>>>> Hi,
>>>>>> 
>>>>>> For the first problem, I can see that when using rdmacm as openib oob
>>>>>> I get much better performence results (and no hangs!).
>>>>>> 
>>>>>> mpirun -display-map -np 64 -machinefile voltairenodes -mca btl
>>>>>> sm,self,openib -mca btl_openib_connect_rdmacm_priority 100
>>>>>> imb/src/IMB-MPI1 gather -npmin 64
>>>>>> 
>>>>>> 
>>>>>> #bytes      #repetitions        t_min[usec]     t_max[usec]     
>>>>>> t_avg[usec]
>>>>>> 
>>>>>> 0       1000        0.04        0.05        0.05
>>>>>> 
>>>>>> 1       1000        19.64       19.69       19.67
>>>>>> 
>>>>>> 2       1000        19.97       20.02       19.99
>>>>>> 
>>>>>> 4       1000        21.86       21.96       21.89
>>>>>> 
>>>>>> 8       1000        22.87       22.94       22.90
>>>>>> 
>>>>>> 16      1000        24.71       24.80       24.76
>>>>>> 
>>>>>> 32      1000        27.23       27.32       27.27
>>>>>> 
>>>>>> 64      1000        30.96       31.06       31.01
>>>>>> 
>>>>>> 128     1000        36.96       37.08       37.02
>>>>>> 
>>>>>> 256     1000        42.64       42.79       42.72
>>>>>> 
>>>>>> 512     1000        60.32       60.59       60.46
>>>>>> 
>>>>>> 1024    1000        82.44       82.74       82.59
>>>>>> 
>>>>>> 2048    1000        497.66      499.62      498.70
>>>>>> 
>>>>>> 4096    1000        684.15      686.47      685.33
>>>>>> 
>>>>>> 8192    519         544.07      546.68      545.85
>>>>>> 
>>>>>> 16384   519         653.20      656.23      655.27
>>>>>> 
>>>>>> 32768   519         704.48      707.55      706.60
>>>>>> 
>>>>>> 65536   519         918.00      922.12      920.86
>>>>>> 
>>>>>> 131072  320         2414.08     2422.17     2418.20
>>>>>> 
>>>>>> 262144  160         4198.25     4227.58     4213.19
>>>>>> 
>>>>>> 524288  80          7333.04     7503.99     7438.18
>>>>>> 
>>>>>> 1048576 40          13692.60    14150.20    13948.75
>>>>>> 
>>>>>> 2097152 20          30377.34    32679.15    31779.86
>>>>>> 
>>>>>> 4194304 10          61416.70    71012.50    68380.04
>>>>>> 
>>>>>> How can the oob cause the hang? Isn't it only used to bring up the 
>>>>>> connection?
>>>>>> Does the oob has any part of the connections were made?
>>>>>> 
>>>>>> Thanks,
>>>>>> Dororn
>>>>>> 
>>>>>> On Tue, Jan 11, 2011 at 2:58 PM, Doron Shoham <doron.o...@gmail.com> 
>>>>>> wrote:
>>>>>>> 
>>>>>>> Hi
>>>>>>> 
>>>>>>> All machines on the setup are IDataPlex with Nehalem 12 cores per node, 
>>>>>>> 24GB  memory.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ·         Problem 1 – OMPI 1.4.3 hangs in gather:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> I’m trying to run IMB and gather operation with OMPI 1.4.3 (Vanilla).
>>>>>>> 
>>>>>>> It happens when np >= 64 and message size exceed 4k:
>>>>>>> 
>>>>>>> mpirun -np 64 -machinefile voltairenodes -mca btl sm,self,openib  
>>>>>>> imb/src-1.4.2/IMB-MPI1 gather –npmin 64
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> voltairenodes consists of 64 machines.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> #----------------------------------------------------------------
>>>>>>> 
>>>>>>> # Benchmarking Gather
>>>>>>> 
>>>>>>> # #processes = 64
>>>>>>> 
>>>>>>> #----------------------------------------------------------------
>>>>>>> 
>>>>>>>  #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
>>>>>>> 
>>>>>>>       0         1000         0.02         0.02         0.02
>>>>>>> 
>>>>>>>       1          331        14.02        14.16        14.09
>>>>>>> 
>>>>>>>       2          331        12.87        13.08        12.93
>>>>>>> 
>>>>>>>       4          331        14.29        14.43        14.34
>>>>>>> 
>>>>>>>       8          331        16.03        16.20        16.11
>>>>>>> 
>>>>>>>      16          331        17.54        17.74        17.64
>>>>>>> 
>>>>>>>      32          331        20.49        20.62        20.53
>>>>>>> 
>>>>>>>      64          331        23.57        23.84        23.70
>>>>>>> 
>>>>>>>     128          331        28.02        28.35        28.18
>>>>>>> 
>>>>>>>     256          331        34.78        34.88        34.80
>>>>>>> 
>>>>>>>     512          331        46.34        46.91        46.60
>>>>>>> 
>>>>>>>    1024          331        63.96        64.71        64.33
>>>>>>> 
>>>>>>>    2048          331       460.67       465.74       463.18
>>>>>>> 
>>>>>>>    4096          331       637.33       643.99       640.75
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> This the padb output:
>>>>>>> 
>>>>>>> padb –A –x –Ormgr=mpirun –tree:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> =~=~=~=~=~=~=~=~=~=~=~= PuTTY log 2011.01.06 14:33:17 
>>>>>>> =~=~=~=~=~=~=~=~=~=~=~=
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Warning, remote process state differs across ranks
>>>>>>> 
>>>>>>> state : ranks
>>>>>>> 
>>>>>>> R (running) : 
>>>>>>> [1,3-6,8,10-13,16-20,23-28,30-32,34-42,44-45,47-49,51-53,56-59,61-63]
>>>>>>> 
>>>>>>> S (sleeping) : [0,2,7,9,14-15,21-22,29,33,43,46,50,54-55,60]
>>>>>>> 
>>>>>>> Stack trace(s) for thread: 1
>>>>>>> 
>>>>>>> -----------------
>>>>>>> 
>>>>>>> [0-63] (64 processes)
>>>>>>> 
>>>>>>> -----------------
>>>>>>> 
>>>>>>> main() at ?:?
>>>>>>> 
>>>>>>> IMB_init_buffers_iter() at ?:?
>>>>>>> 
>>>>>>> IMB_gather() at ?:?
>>>>>>> 
>>>>>>> PMPI_Gather() at pgather.c:175
>>>>>>> 
>>>>>>>   mca_coll_sync_gather() at coll_sync_gather.c:46
>>>>>>> 
>>>>>>>     ompi_coll_tuned_gather_intra_dec_fixed() at 
>>>>>>> coll_tuned_decision_fixed.c:714
>>>>>>> 
>>>>>>>       -----------------
>>>>>>> 
>>>>>>>       [0,3-63] (62 processes)
>>>>>>> 
>>>>>>>       -----------------
>>>>>>> 
>>>>>>>       ompi_coll_tuned_gather_intra_linear_sync() at 
>>>>>>> coll_tuned_gather.c:248
>>>>>>> 
>>>>>>>         mca_pml_ob1_recv() at pml_ob1_irecv.c:104
>>>>>>> 
>>>>>>>           ompi_request_wait_completion() at 
>>>>>>> ../../../../ompi/request/request.h:375
>>>>>>> 
>>>>>>>             opal_condition_wait() at 
>>>>>>> ../../../../opal/threads/condition.h:99
>>>>>>> 
>>>>>>>       -----------------
>>>>>>> 
>>>>>>>       [1] (1 processes)
>>>>>>> 
>>>>>>>       -----------------
>>>>>>> 
>>>>>>>       ompi_coll_tuned_gather_intra_linear_sync() at 
>>>>>>> coll_tuned_gather.c:302
>>>>>>> 
>>>>>>>         mca_pml_ob1_send() at pml_ob1_isend.c:125
>>>>>>> 
>>>>>>>           ompi_request_wait_completion() at 
>>>>>>> ../../../../ompi/request/request.h:375
>>>>>>> 
>>>>>>>             opal_condition_wait() at 
>>>>>>> ../../../../opal/threads/condition.h:99
>>>>>>> 
>>>>>>>       -----------------
>>>>>>> 
>>>>>>>       [2] (1 processes)
>>>>>>> 
>>>>>>>       -----------------
>>>>>>> 
>>>>>>>       ompi_coll_tuned_gather_intra_linear_sync() at 
>>>>>>> coll_tuned_gather.c:315
>>>>>>> 
>>>>>>>         ompi_request_default_wait() at request/req_wait.c:37
>>>>>>> 
>>>>>>>           ompi_request_wait_completion() at 
>>>>>>> ../ompi/request/request.h:375
>>>>>>> 
>>>>>>>             opal_condition_wait() at ../opal/threads/condition.h:99
>>>>>>> 
>>>>>>> Stack trace(s) for thread: 2
>>>>>>> 
>>>>>>> -----------------
>>>>>>> 
>>>>>>> [0-63] (64 processes)
>>>>>>> 
>>>>>>> -----------------
>>>>>>> 
>>>>>>> start_thread() at ?:?
>>>>>>> 
>>>>>>> btl_openib_async_thread() at btl_openib_async.c:344
>>>>>>> 
>>>>>>> poll() at ?:?
>>>>>>> 
>>>>>>> Stack trace(s) for thread: 3
>>>>>>> 
>>>>>>> -----------------
>>>>>>> 
>>>>>>> [0-63] (64 processes)
>>>>>>> 
>>>>>>> -----------------
>>>>>>> 
>>>>>>> start_thread() at ?:?
>>>>>>> 
>>>>>>> service_thread_start() at btl_openib_fd.c:427
>>>>>>> 
>>>>>>> select() at ?:?
>>>>>>> 
>>>>>>> -bash-3.2$
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> When running again padb after couple of minutes, I can see that the 
>>>>>>> total number of processes remain in the same position but
>>>>>>> 
>>>>>>> different processes are at different positions.
>>>>>>> 
>>>>>>> For example, this is the diff between two padb outputs:
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Warning, remote process state differs across ranks
>>>>>>> 
>>>>>>> state : ranks
>>>>>>> 
>>>>>>> -R (running) : [0,2-4,6-13,16-18,20-21,28-31,33-36,38-56,58,60,62-63]
>>>>>>> 
>>>>>>> -S (sleeping) : [1,5,14-15,19,22-27,32,37,57,59,61]
>>>>>>> 
>>>>>>> +R (running) : [2,5-14,16-23,25,28-40,42-48,50-51,53-58,61,63]
>>>>>>> 
>>>>>>> +S (sleeping) : [0-1,3-4,15,24,26-27,41,49,52,59-60,62]
>>>>>>> 
>>>>>>> Stack trace(s) for thread: 1
>>>>>>> 
>>>>>>> -----------------
>>>>>>> 
>>>>>>> [0-63] (64 processes)
>>>>>>> 
>>>>>>> @@ -13,21 +13,21 @@
>>>>>>> 
>>>>>>> mca_coll_sync_gather() at coll_sync_gather.c:46
>>>>>>> 
>>>>>>> ompi_coll_tuned_gather_intra_dec_fixed() at 
>>>>>>> coll_tuned_decision_fixed.c:714
>>>>>>> 
>>>>>>> -----------------
>>>>>>> 
>>>>>>> - [0,3-63] (62 processes)
>>>>>>> 
>>>>>>> + [0-5,8-63] (62 processes)
>>>>>>> 
>>>>>>> -----------------
>>>>>>> 
>>>>>>> ompi_coll_tuned_gather_intra_linear_sync() at coll_tuned_gather.c:248
>>>>>>> 
>>>>>>> mca_pml_ob1_recv() at pml_ob1_irecv.c:104
>>>>>>> 
>>>>>>> ompi_request_wait_completion() at ../../../../ompi/request/request.h:375
>>>>>>> 
>>>>>>> opal_condition_wait() at ../../../../opal/threads/condition.h:99
>>>>>>> 
>>>>>>> -----------------
>>>>>>> 
>>>>>>> - [1] (1 processes)
>>>>>>> 
>>>>>>> + [6] (1 processes)
>>>>>>> 
>>>>>>> -----------------
>>>>>>> 
>>>>>>> ompi_coll_tuned_gather_intra_linear_sync() at coll_tuned_gather.c:302
>>>>>>> 
>>>>>>> mca_pml_ob1_send() at pml_ob1_isend.c:125
>>>>>>> 
>>>>>>> ompi_request_wait_completion() at ../../../../ompi/request/request.h:375
>>>>>>> 
>>>>>>> opal_condition_wait() at ../../../../opal/threads/condition.h:99
>>>>>>> 
>>>>>>> -----------------
>>>>>>> 
>>>>>>> - [2] (1 processes)
>>>>>>> 
>>>>>>> + [7] (1 processes)
>>>>>>> 
>>>>>>> -----------------
>>>>>>> 
>>>>>>> ompi_coll_tuned_gather_intra_linear_sync() at coll_tuned_gather.c:315
>>>>>>> 
>>>>>>> ompi_request_default_wait() at request/req_wait.c:37
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Choosing different gather algorithm seems to bypass the hang.
>>>>>>> 
>>>>>>> I’ve used the following mca parameters:
>>>>>>> 
>>>>>>> --mca coll_tuned_use_dynamic_rules 1
>>>>>>> 
>>>>>>> --mca coll_tuned_gather_algorithm 1
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> Actually, both dec_fixed and basic_linear works while binomial and 
>>>>>>> linear_sync doesn’t.
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> With OMPI 1.5 it doesn’t hangs (with all gather algorithms) and it much 
>>>>>>> faster (the number of repetitions is much higher):
>>>>>>> 
>>>>>>> #----------------------------------------------------------------
>>>>>>> 
>>>>>>> # Benchmarking Gather
>>>>>>> 
>>>>>>> # #processes = 64
>>>>>>> 
>>>>>>> #----------------------------------------------------------------
>>>>>>> 
>>>>>>>  #bytes #repetitions  t_min[usec]  t_max[usec]  t_avg[usec]
>>>>>>> 
>>>>>>>       0         1000         0.02         0.03         0.02
>>>>>>> 
>>>>>>>       1         1000        18.50        18.55        18.53
>>>>>>> 
>>>>>>>       2         1000        18.17        18.25        18.22
>>>>>>> 
>>>>>>>       4         1000        19.04        19.10        19.07
>>>>>>> 
>>>>>>>       8         1000        19.60        19.67        19.64
>>>>>>> 
>>>>>>>      16         1000        21.39        21.47        21.43
>>>>>>> 
>>>>>>>      32         1000        24.83        24.91        24.87
>>>>>>> 
>>>>>>>      64         1000        27.35        27.45        27.40
>>>>>>> 
>>>>>>>     128         1000        33.23        33.34        33.29
>>>>>>> 
>>>>>>>     256         1000        41.24        41.39        41.32
>>>>>>> 
>>>>>>>     512         1000        52.62        52.81        52.71
>>>>>>> 
>>>>>>>    1024         1000        73.20        73.46        73.32
>>>>>>> 
>>>>>>>    2048         1000       416.36       418.04       417.22
>>>>>>> 
>>>>>>>    4096         1000       638.54       640.70       639.65
>>>>>>> 
>>>>>>>    8192         1000       506.26       506.97       506.63
>>>>>>> 
>>>>>>>   16384         1000       600.63       601.40       601.02
>>>>>>> 
>>>>>>>   32768         1000       639.52       640.34       639.93
>>>>>>> 
>>>>>>>   65536          640       914.22       916.02       915.13
>>>>>>> 
>>>>>>>  131072          320      2287.37      2295.18      2291.35
>>>>>>> 
>>>>>>>  262144          160      4041.36      4070.58      4056.27
>>>>>>> 
>>>>>>>  524288           80      7292.35      7463.27      7397.14
>>>>>>> 
>>>>>>> 1048576           40     13647.15     14107.15     13905.29
>>>>>>> 
>>>>>>> 2097152           20     30625.00     32635.45     31815.36
>>>>>>> 
>>>>>>> 4194304           10     63543.01     70987.49     68680.48
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> ·         Problem 2 – segmentation fault with OMPI 1.4.3/1.5 and IMB 
>>>>>>> gather np=768:
>>>>>>> 
>>>>>>> When trying to run the same command but with np=768 I get segmentation 
>>>>>>> fault:
>>>>>>> 
>>>>>>> openmpi-1.4.3/bin/mpirun -np 768 -machinefile voltairenodes -mca btl 
>>>>>>> sm,self,openib -mca coll_tuned_use_dynamic_rules 1 -mca 
>>>>>>> coll_tuned_gather_algorithm 1 imb/src/IMB-MPI1 gather -npmin 768 -mem 
>>>>>>> 1.6
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> This happens in OMPI 1.4.3 and 1.5
>>>>>>> 
>>>>>>> 
>>>>>>> 
>>>>>>> [compa163:20249] *** Process received signal ***
>>>>>>> 
>>>>>>> [compa163:20249] Signal: Segmentation fault (11)
>>>>>>> 
>>>>>>> [compa163:20249] Signal code: Address not mapped (1)
>>>>>>> 
>>>>>>> [compa163:20249] Failing at address: 0x2aab4a204000
>>>>>>> 
>>>>>>> [compa163:20249] [ 0] /lib64/libpthread.so.0 [0x366aa0e7c0]
>>>>>>> 
>>>>>>> [compa163:20249] [ 1] 
>>>>>>> /gpfs/asrc/home/voltaire/install//openmpi-1.4.3/lib/libmpi.so.0(ompi_convertor_unpack+0x15f)
>>>>>>>  [0x2b077882282e]
>>>>>>> 
>>>>>>> [compa163:20249] [ 2] 
>>>>>>> /gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_pml_ob1.so
>>>>>>>  [0x2b077b9e1672]
>>>>>>> 
>>>>>>> [compa163:20249] [ 3] 
>>>>>>> /gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_pml_ob1.so
>>>>>>>  [0x2b077b9dd0b6]
>>>>>>> 
>>>>>>> [compa163:20249] [ 4] 
>>>>>>> /gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_btl_sm.so
>>>>>>>  [0x2b077c459d87]
>>>>>>> 
>>>>>>> [compa163:20249] [ 5] 
>>>>>>> /gpfs/asrc/home/voltaire/install//openmpi-1.4.3/lib/libopen-pal.so.0(opal_progress+0xbe)
>>>>>>>  [0x2b0778d845b8]
>>>>>>> 
>>>>>>> [compa163:20249] [ 6] 
>>>>>>> /gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_pml_ob1.so
>>>>>>>  [0x2b077b9d6d62]
>>>>>>> 
>>>>>>> [compa163:20249] [ 7] 
>>>>>>> /gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_pml_ob1.so
>>>>>>>  [0x2b077b9d6ba7]
>>>>>>> 
>>>>>>> [compa163:20249] [ 8] 
>>>>>>> /gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_pml_ob1.so
>>>>>>>  [0x2b077b9d6a90]
>>>>>>> 
>>>>>>> [compa163:20249] [ 9] 
>>>>>>> /gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_coll_tuned.so
>>>>>>>  [0x2b077d298dc5]
>>>>>>> 
>>>>>>> [compa163:20249] [10] 
>>>>>>> /gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_coll_tuned.so
>>>>>>>  [0x2b077d2990d3]
>>>>>>> 
>>>>>>> [compa163:20249] [11] 
>>>>>>> /gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_coll_tuned.so
>>>>>>>  [0x2b077d286e9b]
>>>>>>> 
>>>>>>> [compa163:20249] [12] 
>>>>>>> /gpfs/asrc/home/voltaire/install/openmpi-1.4.3/lib/openmpi/mca_coll_sync.so
>>>>>>>  [0x2b077d07e96c]
>>>>>>> 
>>>>>>> [compa163:20249] [13] 
>>>>>>> /gpfs/asrc/home/voltaire/install//openmpi-1.4.3/lib/libmpi.so.0(PMPI_Gather+0x55e)
>>>>>>>  [0x2b077883ec9a]
>>>>>>> 
>>>>>>> [compa163:20249] [14] imb/src/IMB-MPI1(IMB_gather+0xe8) [0x40a088]
>>>>>>> 
>>>>>>> [compa163:20249] [15] imb/src/IMB-MPI1(IMB_init_buffers_iter+0x28a) 
>>>>>>> [0x405baa]
>>>>>>> 
>>>>>>> [compa163:20249] [16] imb/src/IMB-MPI1(main+0x30f) [0x40362f]
>>>>>>> 
>>>>>>> [compa163:20249] [17] /lib64/libc.so.6(__libc_start_main+0xf4) 
>>>>>>> [0x3669e1d994]
>>>>>>> 
>>>>>>> [compa163:20249] [18] imb/src/IMB-MPI1 [0x403269]
>>>>>>> 
>>>>>>> [compa163:20249] *** End of error message ***
>>>>>>> 
>>>>>>> 
>>>>>>> Any ideas? More debuggin tips?
>>>>>>> 
>>>>>>> Thanks,
>>>>>>> Doron
>>>>>> 
>>>>>> _______________________________________________
>>>>>> devel mailing list
>>>>>> de...@open-mpi.org
>>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>>> 
>>>>> 
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> 
>>>> --
>>>> Jeff Squyres
>>>> jsquy...@cisco.com
>>>> For corporate legal information go to:
>>>> http://www.cisco.com/web/about/doing_business/legal/cri/
>>>> 
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/

Re: [OMPI devel] OMPI 1.4.3 hangs in gather

Reply via email to