Re: [OMPI devel] 32-bit openib is broken on the trunk as of Nov 27th, r16799
On Wed, Dec 05, 2007 at 02:45:17PM -0500, Tim Mattox wrote: > Hello, > It appears that sometime after r16777, and by r16799, that something > was broken on the trunk's openib support for 32-bit builds. > The 64-bit tests all seem normal, as well as the 32-bit & 64-bit tests on > the 1.2 branch on the same machine (odin). > > See this MTT results page permalink showing the 32-bit odin runs: > http://www.open-mpi.org/mtt/index.php?do_redir=468 > > Pasha & Gleb, you both did a variety of checkins in that svn r# range. > Do either of you have time to investigate this? > > Here is a snippet from one randomly picked failed test (out of thousands): > [1,1][btl_openib_component.c:1665:btl_openib_module_progress] from > odin001 to: odin001 error > polling LP CQ with status LOCAL PROTOCOL ERROR status number 4 for > wr_id 141733120 opcode 128 > qp_idx 3 > -- > mpirun has exited due to process rank 1 with PID 29761 on > node odin001 calling "abort". This will have caused other processes > in the application to be terminated by signals sent by mpirun > (as reported here). > -- > > Thanks, and happy bug hunting! I know where the problem is. Will fix this week. -- Gleb.
[OMPI devel] 32-bit openib is broken on the trunk as of Nov 27th, r16799
Hello, It appears that sometime after r16777, and by r16799, that something was broken on the trunk's openib support for 32-bit builds. The 64-bit tests all seem normal, as well as the 32-bit & 64-bit tests on the 1.2 branch on the same machine (odin). See this MTT results page permalink showing the 32-bit odin runs: http://www.open-mpi.org/mtt/index.php?do_redir=468 Pasha & Gleb, you both did a variety of checkins in that svn r# range. Do either of you have time to investigate this? Here is a snippet from one randomly picked failed test (out of thousands): [1,1][btl_openib_component.c:1665:btl_openib_module_progress] from odin001 to: odin001 error polling LP CQ with status LOCAL PROTOCOL ERROR status number 4 for wr_id 141733120 opcode 128 qp_idx 3 -- mpirun has exited due to process rank 1 with PID 29761 on node odin001 calling "abort". This will have caused other processes in the application to be terminated by signals sent by mpirun (as reported here). -- Thanks, and happy bug hunting! -- Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/ tmat...@gmail.com || timat...@open-mpi.org I'm a bright... http://www.the-brights.net/