Commit 17872 is the one you're looking for.

https://svn.open-mpi.org/trac/ompi/changeset/17872

george.

On Mar 18, 2008, at 9:12 PM, Jeff Squyres wrote:

When did you fix it?  I merged the trunk down to the libevent-merge
branch late this afternoon (r17869).


On Mar 18, 2008, at 7:29 PM, George Bosilca wrote:

This has been fixed in the trunk, but not yet merged in the branch.

george.

On Mar 18, 2008, at 7:17 PM, Josh Hursey wrote:

I found another problem with the libevent branch.

If I set "-mca btl tcp,self" on the command line then I get a segfult
when sending messages > 16 KB. I can try to make a smaller repeater,
but if you use the "progress" or "simple" tests in ompi-tests below:
https://svn.open-mpi.org/svn/ompi-tests/trunk/iu/ft/correctness

To build:
shell$ make
To run with failure:
shell$ mpirun  -np 2 -mca btl tcp,self progress  -s 16 -v 1
To run without failure:
shell$ mpirun  -np 2 -mca btl tcp,self progress  -s 15 -v 1

This program will display the message "Checkpoint at any time...". If
you send mpirun SIGUSR2 it will progress to the next stage of the
test. The failure occurs when the first message before this becomes
an issue though.

I was using Odin, and if I do not specify the btls then the test will
pass as normal.

The backtrace is below:
------------------------------------------
...
Core was generated by `progress -s 16 -v 1'.
Program terminated with signal 11, Segmentation fault.
#0  0x0000002a9793318b in mca_bml_base_free
(bml_btl=0x736275705f61636d, des=0x559700) at ../../../../ompi/mca/
bml/bml.h:267
267         bml_btl->btl_free( bml_btl->btl, des );
(gdb) bt
#0  0x0000002a9793318b in mca_bml_base_free
(bml_btl=0x736275705f61636d, des=0x559700) at ../../../../ompi/mca/
bml/bml.h:267
#1  0x0000002a9793304d in mca_pml_ob1_put_completion (btl=0x5598c0,
ep=0x0, des=0x559700, status=0) at pml_ob1_recvreq.c:190
#2  0x0000002a97930069 in mca_pml_ob1_recv_frag_callback
(btl=0x5598c0, tag=64 '@', des=0x2a989d2b00, cbdata=0x0) at
pml_ob1_recvfrag.c:149
#3  0x0000002a97d5f3e0 in mca_btl_tcp_endpoint_recv_handler (sd=10,
flags=2, user=0x5a5df0) at btl_tcp_endpoint.c:696
#4  0x0000002a95a0ab93 in event_process_active (base=0x508c80) at
event.c:591
#5  0x0000002a95a0af59 in opal_event_base_loop (base=0x508c80,
flags=2) at event.c:763
#6  0x0000002a95a0ad2b in opal_event_loop (flags=2) at event.c:670
#7  0x0000002a959fadf8 in opal_progress () at runtime/
opal_progress.c:
169
#8  0x0000002a9792caae in opal_condition_wait (c=0x2a9587d940,
m=0x2a9587d9c0) at ../../../../opal/threads/condition.h:93
#9 0x0000002a9792c9dd in ompi_request_wait_completion (req=0x5a5380)
at ../../../../ompi/request/request.h:381
#10 0x0000002a9792c920 in mca_pml_ob1_recv (addr=0x5baf70,
count=16384, datatype=0x503770, src=1, tag=1001, comm=0x5039a0,
status=0x0)
  at pml_ob1_irecv.c:104
#11 0x0000002a956f1f00 in PMPI_Recv (buf=0x5baf70, count=16384,
type=0x503770, source=1, tag=1001, comm=0x5039a0, status=0x0) at
precv.c:75
#12 0x000000000040211f in exchange_stage1 (ckpt_num=1) at
progress.c:414
#13 0x0000000000401295 in main (argc=5, argv=0x7fbfffe668) at
progress.c:131
(gdb) p bml_btl
$1 = (mca_bml_base_btl_t *) 0x736275705f61636d
(gdb) p *bml_btl
Cannot access memory at address 0x736275705f61636d
------------------------------------------

-- Josh

On Mar 17, 2008, at 2:50 PM, Jeff Squyres wrote:

WHAT: Bring new version of libevent to the trunk.

WHY: Newer version, slightly better performance (lower overheads /
lighter weight), properly integrate the use of epoll and other
scalable fd monitoring mechanisms.

WHERE: 98% of the changes are in opal/event; there's a few changes
to
configury and one change to the orted.

TIMEOUT: COB, Friday, 21 March 2008

DESCRIPTION:

George/UTK has done the bulk of the work to integrate a new
version of
libevent on the following tmp branch:

  https://svn.open-mpi.org/svn/ompi/tmp-public/libevent-merge

** WE WOULD VERY MUCH APPRECIATE IF PEOPLE COULD MTT TEST THIS
BRANCH!
**

Cisco ran MTT on this branch on Friday and everything checked out
(i.e., no more failures than on the trunk). We just made a few more
minor changes today and I'm running MTT again now, but I'm not
expecting any new failures (MTT will take several hours).  We would
like to bring the new libevent in over this upcoming weekend, but
would very much appreciate if others could test on their platforms
(Cisco tests mainly 64 bit RHEL4U4).  This new libevent *should*
be a
fairly side-effect free change, but it is possible that since we're
now using epoll and other scalable fd monitoring tools, we'll run
into
some unanticipated issues on some platforms.

Here's a consolidated diff if you want to see the changes:

https://svn.open-mpi.org/trac/ompi/changeset?old_path=tmp-public%
2Flibevent-merge&old=17846&new_path=trunk&new=17842

Thanks.

--
Jeff Squyres
Cisco Systems

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel


--
Jeff Squyres
Cisco Systems

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Attachment: smime.p7s
Description: S/MIME cryptographic signature

Reply via email to