I think that this is relatively contained and has not been seen out of MTT under normal operating conditions. Also, as Jeff has argued, it doesn't appear to be a regression against 1.3. George & I talked about this and we are in agreement that we should go ahead and release 1.3.1 as it currently stands. --brad
On Wed, Mar 11, 2009 at 7:58 AM, Jeff Squyres <jsquy...@cisco.com> wrote: > So -- Brad/George -- this technically isn't a regression against v1.3.0 (do > we know if this can happen in 1.2? I don't recall seeing it there, but if > it's so elusive... I haven't been MTT testing the 1.2 series in a long > time). But it is a nonzero problem. > > Should we release 1.3.1 without a fix? > > > On Mar 11, 2009, at 7:30 AM, Ralph Castain wrote: > > I actually wasn't implying that Eugene's changes -caused- the problem, >> but rather that I thought they might have -fixed- the problem. >> >> :-) >> >> >> On Mar 11, 2009, at 4:34 AM, Terry Dontje wrote: >> >> > I forgot to mention that since I ran into this issue so long ago I >> > really doubt that Eugene's SM changes has caused this issue. >> > >> > --td >> > >> > Terry Dontje wrote: >> >> Hey!!! I ran into this problem many months ago but its been so >> >> elusive that I've haven't nailed it down. First time we saw this >> >> was last October. I did some MTT gleaning and could not find >> >> anyone but Solaris having this issue under MTT. What's interesting >> >> is I gleaned Sun's MTT results and could not find any of these >> >> failures as far back as last October. >> >> What it looked like to me was that the shared memory segment might >> >> not have been initialized with 0's thus allowing one of the >> >> processes to start accessing addresses that did not have an >> >> appropriate address. However, when I was looking at this I was >> >> told the mmap file was created with ftruncate which essentially >> >> should 0 fill the memory. So I was at a loss as to why this was >> >> happening. >> >> >> >> I was able to reproduce this for a little while manually setting up >> >> a script that ran and small np=2 program over and over for sometime >> >> under 3-4 days. But around November I was unable to reproduce the >> >> issue after 4 days of runs and threw up my hands until I was able >> >> to find more failures under MTT which for Sun I haven't. >> >> >> >> Note that I was able to reproduce this issue with both SPARC and >> >> Intel based platforms. >> >> >> >> --td >> >> >> >> Ralph Castain wrote: >> >>> Hey Jeff >> >>> >> >>> I seem to recall seeing the identical problem reported on the user >> >>> list not long ago...or may have been the devel list. Anyway, it >> >>> was during btl_sm_add_procs, and the code was segv'ing. >> >>> >> >>> I don't have the archives handy here, but perhaps you might search >> >>> them and see if there is a common theme here. IIRC, some of >> >>> Eugene's fixes impacted this problem. >> >>> >> >>> Ralph >> >>> >> >>> >> >>> On Mar 10, 2009, at 7:49 PM, Jeff Squyres wrote: >> >>> >> >>>> On Mar 10, 2009, at 9:13 PM, Jeff Squyres (jsquyres) wrote: >> >>>> >> >>>>> Doh -- I'm seeing intermittent sm btl failures on Cisco 1.3.1 >> >>>>> MTT. :- >> >>>>> ( I can't reproduce them manually, but they seem to only happen >> >>>>> in a >> >>>>> very small fraction of overall MTT runs. I'm seeing at least 3 >> >>>>> classes of errors: >> >>>>> >> >>>>> 1. btl_sm_add_procs.c:529 which is this: >> >>>>> >> >>>>> if(mca_btl_sm_component.fifo[j][my_smp_rank].head_lock != >> >>>>> NULL) { >> >>>>> >> >>>>> j = 3, my_smp_rank = 1. mca_btl_sm_component.fifo[j][my_smp_rank] >> >>>>> appears to have a valid value in it (i.e., .fifo[3][0] = >> >>>>> x, .fifo[3] >> >>>>> [1] = x+offset, .fifo[3][2] = x+2*offset, .fifo[3][3] = x >> >>>>> +3*offset. >> >>>>> But gdb says: >> >>>>> >> >>>>> (gdb) print mca_btl_sm_component.fifo[j][my_smp_rank] >> >>>>> Cannot access memory at address 0x2a96b73050 >> >>>>> >> >>>> >> >>>> >> >>>> Bah -- this is a red herring; this memory is in the shared memory >> >>>> segment, and that memory is not saved in the corefile. So of >> >>>> course gdb can't access it (I just did a short controlled test >> >>>> and proved this to myself). >> >>>> >> >>>> But I don't understand why I would have a bunch of tests that all >> >>>> segv at btl_sm_add_procs.c:529. :-( >> >>>> >> >>>> -- >> >>>> Jeff Squyres >> >>>> Cisco Systems >> >>>> >> >>>> _______________________________________________ >> >>>> devel mailing list >> >>>> de...@open-mpi.org >> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >>> >> >>> _______________________________________________ >> >>> devel mailing list >> >>> de...@open-mpi.org >> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> >> >> > >> > _______________________________________________ >> > devel mailing list >> > de...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> > > -- > Jeff Squyres > Cisco Systems > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >