To that point, where exactly in the openib BTL init / query sequence is it returning an error for you, Sylvain? Is it just a matter of tidying something up properly before returning the error?
On May 28, 2010, at 2:21 PM, George Bosilca wrote: > On May 28, 2010, at 10:03 , Sylvain Jeaugey wrote: > > > On Fri, 28 May 2010, Jeff Squyres wrote: > > > >> On May 28, 2010, at 9:32 AM, Jeff Squyres wrote: > >> > >>> Understood, and I agreed that the bug should be fixed. Patches would be > >>> welcome. :-) > > I sent a patch on the bml layer in my first e-mail. We will apply it on our > > tree, but as always we're trying to send patches back to open-source (that > > was not my intent to start such a debate). > > The only problem with your patch is that it solve something that is not > supposed to happen. As a proof of concept I did return errors from the tcp > and sm BTLs, and Open MPI gracefully deal with them. So, it is not a matter > of aborting we're looking at is a matter of the opebib BTL doing something it > is not supposed to do. > > Going through the code it looks like the bitmask doesn't matter, if an error > is returned by a BTL we zero the bitmask and continue to another BTL. > > Example: the SM BTL returns OMPI_ERROR after creating all the internal > structures. > > >> mpirun -np 4 --host node01 --mca btl sm,self ./ring > > -------------------------------------------------------------------------- > At least one pair of MPI processes are unable to reach each other for > MPI communications. This means that no Open MPI device has indicated > that it can be used to communicate between these processes. This is > an error; Open MPI requires that all MPI processes be able to reach > each other. This error can sometimes be the result of forgetting to > specify the "self" BTL. > > Process 1 ([[22047,1],3]) is on host: node01 > Process 2 ([[22047,1],0]) is on host: node01 > BTLs attempted: self sm > > Your MPI job is now going to abort; sorry. > -------------------------------------------------------------------------- > > Now if I allow TCP on the node: > >> mpirun -np 4 --host node01 --mca btl sm,self,tcp ./ring > > Process 0 sending 10 to 1, tag 201 (4 procs in ring) > Process 0 sent to 1 > Process 3 exiting > Process 0 decremented num: 9 > Process 0 decremented num: 8 > > Thus, Open MPI does the right thing when the BTLs are playing the game. > > george. > > > > >> I should clarify rather than being flip: > >> > >> 1. I agree: the bug should be fixed. Clearly, we should never crash. > >> > >> 2. After the bug is fixed, there is clearly a choice: some people may want > >> to use a different transport if a given BTL is unavailable. Others may > >> want to abort. Once the bug is fixed, this seems like a pretty > >> straightforward thing to add. > > If you use my patch, you still have no choice. Errors on BTLs lead to an > > immediate stop instead of trying to continue (and crash). > > If someone wants to go further on this, then that's great. If nobody does, > > I think you should take my patch. Maybe it's not the best solution, but > > it's still better than the current state. > > > > Sylvain > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/