Is it time to "svn rm ompi/mca/pml/dr"?
On Aug 4, 2009, at 6:50 AM, Ralph Castain wrote:
Rolf/Mouhamed Could you get together off-list to discuss the different approaches and see if/where there is common ground. It would be nice to see an integrated solution - personally, I would rather not see two orthogonal approaches unless they can be cleanly separated. Much better if they could support each other in an intelligent fashion. On Aug 3, 2009, at 9:49 AM, Pavel Shamis (Pasha) wrote: > > >> I have not, but there should be no difference. The failover code >> only gets triggered when an error happens. Otherwise, there are no >> differences in the code paths while everything is functioning >> normally. > Sounds good. I still did not have time to review the code. I will > try to do it during this week. > > Pasha >> >> Rolf >> >> On 08/03/09 11:14, Pavel Shamis (Pasha) wrote: >>> Rolf, >>> Did you compare latency/bw for failover-enabled code VS trunk ? >>> >>> Pasha. >>> >>> Rolf Vandevaart wrote: >>>> Hi folks: >>>> >>>> As some of you know, I have also been looking into implementing >>>> failover as well. I took a different approach as I am solving >>>> the problem within the openib BTL itself. This of course means >>>> that this only works for failing from one openib BTL to another >>>> but that was our area of interest. This also means that we do >>>> not need to keep track of fragments as we get them back from the >>>> completion queue upon failure. We then extract the relevant >>>> information and repost on the other working endpoint. >>>> >>>> My work has been progressing at http://bitbucket.org/rolfv/ompi-failover >>>> . >>>> >>>> This only currently works for send semantics so you have to run >>>> with -mca btl_openib_flags 1. >>>> >>>> Rolf >>>> >>>> On 07/31/09 05:49, Mouhamed Gueye wrote: >>>>> Hi list, >>>>> >>>>> Here is an update on our work concerning device failover. >>>>> >>>>> As many of you suggested, we reoriented our work on ob1 rather >>>>> than dr and we now have a working prototype on top of ob1. The >>>>> approach is to store btl descriptors sent to peers and delete >>>>> them when we receive proof of delivery. So far, we rely on >>>>> completion callback functions, assuming that the message is >>>>> delivered when the completion function is called, that is the >>>>> case of openib. When a btl module fails, it is removed from the >>>>> endpoint's btl list and the next one is used to retransmit >>>>> stored descriptors. No extra-message is transmitted, it only >>>>> consists in additions to the header. It has been mainly tested >>>>> with two IB modules, in both multi-rail (two separate networks) >>>>> and multi-path (a big unique network). >>>>> >>>>> You can grab and test the patch here (applies on top of the >>>>> trunk) : >>>>> http://bitbucket.org/gueyem/ob1-failover/ >>>>> >>>>> To compile with failover support, just define --enable-device- >>>>> failover at configure. You can then run a benchmark, disconnect >>>>> a port and see the failover operate. >>>>> >>>>> A little latency increase (~ 2%) is induced by the failover >>>>> layer when no failover occurs. To accelerate the failover >>>>> process on openib, you can try to lower the >>>>> btl_openib_ib_timeout openib parameter to 15 for example instead >>>>> of 20 (default value). >>>>> >>>>> Mouhamed >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> >>> >> >> > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel _______________________________________________ devel mailing list de...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/devel
-- Jeff Squyres jsquy...@cisco.com