What is the error that you are getting from compilation failure? Lenny.
On 3/23/09, Timothy Hayes <haye...@tcd.ie> wrote: > > That's a relief to know, although I'm still a bit concerned. I'm looking at > the code for the OpenMPI 1.3 trunk and in the ob1 component I can see the > following sequence: > > mca_pml_ob1_recv_frag_callback_match -> append_frag_to_list -> > MCA_PML_OB1_RECV_FRAG_ALLOC -> OMPI_FREE_LIST_WAIT -> __ompi_free_list_wait > > so I'm guessing unless the deadlock issue has been resolved for that > function, it will still fail non deterministically. I'm quite eager to give > it a try, but my component doesn't compile as is with the 1.3 source. Is it > trivial to convert it? > > Or maybe you were suggesting that I go into the code of ob1 myself and > manually change every _wait to _get? > > Kind regards > Tim > > 2009/3/23 George Bosilca <bosi...@eecs.utk.edu> > >> It is a known problem. When the freelist is empty going in the >> ompi_free_list_wait will block the process until at least one fragment >> became available. As a fragment can became available only when returned by >> the BTL, this can lead to deadlocks in some cases. The workaround is to ban >> the usage of the blocking _wait function, and replace it with the >> non-blocking version _get. The PML has all the required logic to deal with >> the cases where a fragment cannot be allocated. We changed most of the BTLs >> to use _get instead of _wait few months ago. >> >> Thanks, >> george. >> >> On Mar 23, 2009, at 11:58 , Timothy Hayes wrote: >> >> Hello, >>> >>> I'm working on an OpenMPI BTL component and am having a recurring >>> problem, I was wondering if anyone could shed some light on it. I have a >>> component that's quite straight forward, it uses a pair of lightweight >>> sockets to take advantage of being in a virtualised environment >>> (specifically Xen). My code is a bit messy and has lots of inefficiencies, >>> but the logic seems sound enough. I've been able to execute a few simple >>> programs successfully using the component, and they work most of the time. >>> >>> The problem I'm having is actually happening in higher layers, >>> specifically in my asynchronous receive handler, when I call the callback >>> function (cbfunc) that was set by the PML in the BTL initialisation phase. >>> It seems to be getting stuck in an infinite loop at __ompi_free_list_wait(), >>> in this function there is a condition variable which should get set >>> eventually but just doesn't. I've stepped through it with GDB and I get a >>> backtrace of something like this: >>> >>> mca_btl_xen_endpoint_recv_handler -> mca_btl_xen_endpoint_start_recv -> >>> mca_pml_ob1_recv_frag_callback -> mca_pml_ob1_recv_frag_match -> >>> __ompi_free_list_wait -> opal_condition_wait >>> >>> and from there it just loops. Although this is happening in higher >>> levels, I haven't noticed something like this happening in any of the other >>> BTL components so chances are there's something in my code that's causing >>> this. I very much doubt that it's actually waiting for a list item to be >>> returned since this infinite loop can occur non deterministically and >>> sometimes even on the first receive callback. >>> >>> I'm really not too sure what else to include with this e-mail. I could >>> send my source code (a bit nasty right now) if it would be helpful, but I'm >>> hoping that someone might have noticed this problem before or something >>> similar. Maybe I'm making a common mistake. Any advice would be really >>> appreciated! >>> >>> I'm using OpenMPI 1.2.9 from the SVN tag repository. >>> >>> Kind regards >>> Tim Hayes >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >