I found another problem with the libevent branch.
If I set "-mca btl tcp,self" on the command line then I get a
segfult
when sending messages > 16 KB. I can try to make a smaller
repeater,
but if you use the "progress" or "simple" tests in ompi-tests
below:
https://svn.open-mpi.org/svn/ompi-tests/trunk/iu/ft/correctness
To build:
shell$ make
To run with failure:
shell$ mpirun -np 2 -mca btl tcp,self progress -s 16 -v 1
To run without failure:
shell$ mpirun -np 2 -mca btl tcp,self progress -s 15 -v 1
This program will display the message "Checkpoint at any
time...". If
you send mpirun SIGUSR2 it will progress to the next stage of the
test. The failure occurs when the first message before this becomes
an issue though.
I was using Odin, and if I do not specify the btls then the test
will
pass as normal.
The backtrace is below:
------------------------------------------
...
Core was generated by `progress -s 16 -v 1'.
Program terminated with signal 11, Segmentation fault.
#0 0x0000002a9793318b in mca_bml_base_free
(bml_btl=0x736275705f61636d, des=0x559700) at ../../../../ompi/mca/
bml/bml.h:267
267 bml_btl->btl_free( bml_btl->btl, des );
(gdb) bt
#0 0x0000002a9793318b in mca_bml_base_free
(bml_btl=0x736275705f61636d, des=0x559700) at ../../../../ompi/mca/
bml/bml.h:267
#1 0x0000002a9793304d in mca_pml_ob1_put_completion (btl=0x5598c0,
ep=0x0, des=0x559700, status=0) at pml_ob1_recvreq.c:190
#2 0x0000002a97930069 in mca_pml_ob1_recv_frag_callback
(btl=0x5598c0, tag=64 '@', des=0x2a989d2b00, cbdata=0x0) at
pml_ob1_recvfrag.c:149
#3 0x0000002a97d5f3e0 in mca_btl_tcp_endpoint_recv_handler (sd=10,
flags=2, user=0x5a5df0) at btl_tcp_endpoint.c:696
#4 0x0000002a95a0ab93 in event_process_active (base=0x508c80) at
event.c:591
#5 0x0000002a95a0af59 in opal_event_base_loop (base=0x508c80,
flags=2) at event.c:763
#6 0x0000002a95a0ad2b in opal_event_loop (flags=2) at event.c:670
#7 0x0000002a959fadf8 in opal_progress () at runtime/
opal_progress.c:
169
#8 0x0000002a9792caae in opal_condition_wait (c=0x2a9587d940,
m=0x2a9587d9c0) at ../../../../opal/threads/condition.h:93
#9 0x0000002a9792c9dd in ompi_request_wait_completion
(req=0x5a5380)
at ../../../../ompi/request/request.h:381
#10 0x0000002a9792c920 in mca_pml_ob1_recv (addr=0x5baf70,
count=16384, datatype=0x503770, src=1, tag=1001, comm=0x5039a0,
status=0x0)
at pml_ob1_irecv.c:104
#11 0x0000002a956f1f00 in PMPI_Recv (buf=0x5baf70, count=16384,
type=0x503770, source=1, tag=1001, comm=0x5039a0, status=0x0) at
precv.c:75
#12 0x000000000040211f in exchange_stage1 (ckpt_num=1) at
progress.c:414
#13 0x0000000000401295 in main (argc=5, argv=0x7fbfffe668) at
progress.c:131
(gdb) p bml_btl
$1 = (mca_bml_base_btl_t *) 0x736275705f61636d
(gdb) p *bml_btl
Cannot access memory at address 0x736275705f61636d
------------------------------------------
-- Josh
On Mar 17, 2008, at 2:50 PM, Jeff Squyres wrote:
WHAT: Bring new version of libevent to the trunk.
WHY: Newer version, slightly better performance (lower overheads /
lighter weight), properly integrate the use of epoll and other
scalable fd monitoring mechanisms.
WHERE: 98% of the changes are in opal/event; there's a few changes
to
configury and one change to the orted.
TIMEOUT: COB, Friday, 21 March 2008
DESCRIPTION:
George/UTK has done the bulk of the work to integrate a new
version of
libevent on the following tmp branch:
https://svn.open-mpi.org/svn/ompi/tmp-public/libevent-merge
** WE WOULD VERY MUCH APPRECIATE IF PEOPLE COULD MTT TEST THIS
BRANCH!
**
Cisco ran MTT on this branch on Friday and everything checked out
(i.e., no more failures than on the trunk). We just made a few
more
minor changes today and I'm running MTT again now, but I'm not
expecting any new failures (MTT will take several hours). We
would
like to bring the new libevent in over this upcoming weekend, but
would very much appreciate if others could test on their platforms
(Cisco tests mainly 64 bit RHEL4U4). This new libevent *should*
be a
fairly side-effect free change, but it is possible that since
we're
now using epoll and other scalable fd monitoring tools, we'll run
into
some unanticipated issues on some platforms.
Here's a consolidated diff if you want to see the changes:
https://svn.open-mpi.org/trac/ompi/changeset?old_path=tmp-public%
2Flibevent-merge&old=17846&new_path=trunk&new=17842
Thanks.
--
Jeff Squyres
Cisco Systems
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel