Re: [OMPI devel] 32-bit openib is broken on the trunk as of Nov 27th, r16799

2007-12-09 Thread Gleb Natapov
On Wed, Dec 05, 2007 at 02:45:17PM -0500, Tim Mattox wrote:
> Hello,
> It appears that sometime after r16777, and by r16799, that something
> was broken on the trunk's openib support for 32-bit builds.
> The 64-bit tests all seem normal, as well as the 32-bit & 64-bit tests on
> the 1.2 branch on the same machine (odin).
> 
> See this MTT results page permalink showing the 32-bit odin runs:
> http://www.open-mpi.org/mtt/index.php?do_redir=468
> 
> Pasha & Gleb, you both did a variety of checkins in that svn r# range.
> Do either of you have time to investigate this?
> 
> Here is a snippet from one randomly picked failed test (out of thousands):
> [1,1][btl_openib_component.c:1665:btl_openib_module_progress] from
> odin001 to: odin001 error
> polling LP CQ with status LOCAL PROTOCOL ERROR status number 4 for
> wr_id 141733120 opcode 128
> qp_idx 3
> --
> mpirun has exited due to process rank 1 with PID 29761 on
> node odin001 calling "abort". This will have caused other processes
> in the application to be terminated by signals sent by mpirun
> (as reported here).
> --
> 
> Thanks, and happy bug hunting!
I know where the problem is. Will fix this week.
--
Gleb.


[OMPI devel] 32-bit openib is broken on the trunk as of Nov 27th, r16799

2007-12-05 Thread Tim Mattox
Hello,
It appears that sometime after r16777, and by r16799, that something
was broken on the trunk's openib support for 32-bit builds.
The 64-bit tests all seem normal, as well as the 32-bit & 64-bit tests on
the 1.2 branch on the same machine (odin).

See this MTT results page permalink showing the 32-bit odin runs:
http://www.open-mpi.org/mtt/index.php?do_redir=468

Pasha & Gleb, you both did a variety of checkins in that svn r# range.
Do either of you have time to investigate this?

Here is a snippet from one randomly picked failed test (out of thousands):
[1,1][btl_openib_component.c:1665:btl_openib_module_progress] from
odin001 to: odin001 error
polling LP CQ with status LOCAL PROTOCOL ERROR status number 4 for
wr_id 141733120 opcode 128
qp_idx 3
--
mpirun has exited due to process rank 1 with PID 29761 on
node odin001 calling "abort". This will have caused other processes
in the application to be terminated by signals sent by mpirun
(as reported here).
--

Thanks, and happy bug hunting!
-- 
Tim Mattox, Ph.D. - http://homepage.mac.com/tmattox/
 tmat...@gmail.com || timat...@open-mpi.org
I'm a bright... http://www.the-brights.net/