On Wed, Mar/11/2009 11:38:19AM, Jeff Squyres wrote: > As Terry stated, I think this bugger is quite rare. I'm having a helluva > time trying to reproduce it manually (over 5k runs this morning and still > no segv). Ugh.
5k of which test(s)? Can this error happen on any test? I am wondering if we could narrow down to a smaller subset of the nightly tests to reproduce this (the way Terry did by looping over the same test(s) for a looong time). I see the following over the past 30 days: # | Date range | Org | Hostname | Platform name | Hardware | OS | MPI name | MPI version | Suite | Test | np | Stdout | Pass | Fail | Skip | Timed | Perf 1 | 2009-02-12 06:47:56 | sun | burl-ct-v440-2 | burl-ct-v440-2 | sun4u | SunOS | ompi-nightly-v1.3 | 1.3.1a0r20508 | ibm-64 | cxx_create_disp | 8 | btl_sm_add_procs | 0 | 1 | 0 | 0 | 0 2 | 2009-02-27 23:37:02 | sun | burl-ct-v20z-2 | burl-ct-v20z-2 | i86pc | SunOS | ompi-nightly-v1.3 | 1.3.1rc1r20628 | ibm-64 | lbub | 4 | btl_sm_add_procs | 0 | 1 | 0 | 0 | 0 3 | 2009-03-05 00:15:39 | sun | burl-ct-v20z-2 | burl-ct-v20z-2 | i86pc | SunOS | ompi-nightly-v1.3 | 1.3.1rc3r20684 | ibm-32 | loop | 4 | btl_sm_add_procs | 0 | 1 | 0 | 0 | 0 4 | 2009-03-05 22:31:43 | sun | burl-ct-v20z-2 | burl-ct-v20z-2 | i86pc | SunOS | ompi-nightly-v1.3 | 1.3.1rc4r20704 | intel-64 | MPI_Type_size_MPI_LB_UB_c | 4 | btl_sm_add_procs | 0 | 1 | 0 | 0 | 0 5 | 2009-03-10 14:47:36 | cisco | svbu-mpi[035-036] | svbu-mpi | x86_64 | Linux | ompi-nightly-v1.3 | 1.3.1rc5r20730 | intel | MPI_Test_cancelled_false_c | 8 | btl_sm_add_procs | 0 | 1 | 0 | 0 | 0 What do these tests have in common? ./intel_tests/src/MPI_Test_cancelled_false_c.c ./intel_tests/src/MPI_Type_size_MPI_LB_UB_c.c ./ibm/onesided/cxx_create_disp.cc ./ibm/datatype/lbub2.c ./ibm/datatype/loop.c ./ibm/datatype/lbub.c It almost looks like the problem is more likely to occur if MPI_UB or MPI_LB is involved or am I just imagining things? -Ethan > > Looking through the sm startup code, I can't see exactly what the problem > would be. :-( > > > On Mar 11, 2009, at 11:34 AM, Ralph Castain wrote: > >> I'll run some tests with 1.3.1 on one of our systems and see if it >> shows up there. If it is truly rare and was in 1.3.0, then I >> personally don't have a problem with it. Got bigger problems with >> hanging collectives, frankly - and we don't know how the sm changes >> will affect this problem, if at all. >> >> >> On Mar 11, 2009, at 7:50 AM, Terry Dontje wrote: >> >> > Jeff Squyres wrote: >> >> So -- Brad/George -- this technically isn't a regression against >> >> v1.3.0 (do we know if this can happen in 1.2? I don't recall >> >> seeing it there, but if it's so elusive... I haven't been MTT >> >> testing the 1.2 series in a long time). But it is a nonzero problem. >> >> >> > I have not seen 1.2 fail with this problem but I honestly don't know >> > if that is a fluke or not. >> > >> > --td >> > >> >> Should we release 1.3.1 without a fix? >> >> >> > >> >> >> >> On Mar 11, 2009, at 7:30 AM, Ralph Castain wrote: >> >> >> >>> I actually wasn't implying that Eugene's changes -caused- the >> >>> problem, >> >>> but rather that I thought they might have -fixed- the problem. >> >>> >> >>> :-) >> >>> >> >>> >> >>> On Mar 11, 2009, at 4:34 AM, Terry Dontje wrote: >> >>> >> >>> > I forgot to mention that since I ran into this issue so long ago I >> >>> > really doubt that Eugene's SM changes has caused this issue. >> >>> > >> >>> > --td >> >>> > >> >>> > Terry Dontje wrote: >> >>> >> Hey!!! I ran into this problem many months ago but its been so >> >>> >> elusive that I've haven't nailed it down. First time we saw this >> >>> >> was last October. I did some MTT gleaning and could not find >> >>> >> anyone but Solaris having this issue under MTT. What's >> >>> interesting >> >>> >> is I gleaned Sun's MTT results and could not find any of these >> >>> >> failures as far back as last October. >> >>> >> What it looked like to me was that the shared memory segment >> >>> might >> >>> >> not have been initialized with 0's thus allowing one of the >> >>> >> processes to start accessing addresses that did not have an >> >>> >> appropriate address. However, when I was looking at this I was >> >>> >> told the mmap file was created with ftruncate which essentially >> >>> >> should 0 fill the memory. So I was at a loss as to why this was >> >>> >> happening. >> >>> >> >> >>> >> I was able to reproduce this for a little while manually >> >>> setting up >> >>> >> a script that ran and small np=2 program over and over for >> >>> sometime >> >>> >> under 3-4 days. But around November I was unable to reproduce >> >>> the >> >>> >> issue after 4 days of runs and threw up my hands until I was able >> >>> >> to find more failures under MTT which for Sun I haven't. >> >>> >> >> >>> >> Note that I was able to reproduce this issue with both SPARC and >> >>> >> Intel based platforms. >> >>> >> >> >>> >> --td >> >>> >> >> >>> >> Ralph Castain wrote: >> >>> >>> Hey Jeff >> >>> >>> >> >>> >>> I seem to recall seeing the identical problem reported on the >> >>> user >> >>> >>> list not long ago...or may have been the devel list. Anyway, it >> >>> >>> was during btl_sm_add_procs, and the code was segv'ing. >> >>> >>> >> >>> >>> I don't have the archives handy here, but perhaps you might >> >>> search >> >>> >>> them and see if there is a common theme here. IIRC, some of >> >>> >>> Eugene's fixes impacted this problem. >> >>> >>> >> >>> >>> Ralph >> >>> >>> >> >>> >>> >> >>> >>> On Mar 10, 2009, at 7:49 PM, Jeff Squyres wrote: >> >>> >>> >> >>> >>>> On Mar 10, 2009, at 9:13 PM, Jeff Squyres (jsquyres) wrote: >> >>> >>>> >> >>> >>>>> Doh -- I'm seeing intermittent sm btl failures on Cisco 1.3.1 >> >>> >>>>> MTT. :- >> >>> >>>>> ( I can't reproduce them manually, but they seem to only >> >>> happen >> >>> >>>>> in a >> >>> >>>>> very small fraction of overall MTT runs. I'm seeing at >> >>> least 3 >> >>> >>>>> classes of errors: >> >>> >>>>> >> >>> >>>>> 1. btl_sm_add_procs.c:529 which is this: >> >>> >>>>> >> >>> >>>>> if(mca_btl_sm_component.fifo[j] >> >>> [my_smp_rank].head_lock != >> >>> >>>>> NULL) { >> >>> >>>>> >> >>> >>>>> j = 3, my_smp_rank = 1. mca_btl_sm_component.fifo[j] >> >>> [my_smp_rank] >> >>> >>>>> appears to have a valid value in it (i.e., .fifo[3][0] = >> >>> >>>>> x, .fifo[3] >> >>> >>>>> [1] = x+offset, .fifo[3][2] = x+2*offset, .fifo[3][3] = x >> >>> >>>>> +3*offset. >> >>> >>>>> But gdb says: >> >>> >>>>> >> >>> >>>>> (gdb) print mca_btl_sm_component.fifo[j][my_smp_rank] >> >>> >>>>> Cannot access memory at address 0x2a96b73050 >> >>> >>>>> >> >>> >>>> >> >>> >>>> >> >>> >>>> Bah -- this is a red herring; this memory is in the shared >> >>> memory >> >>> >>>> segment, and that memory is not saved in the corefile. So of >> >>> >>>> course gdb can't access it (I just did a short controlled test >> >>> >>>> and proved this to myself). >> >>> >>>> >> >>> >>>> But I don't understand why I would have a bunch of tests that >> >>> all >> >>> >>>> segv at btl_sm_add_procs.c:529. :-( >> >>> >>>> >> >>> >>>> -- >> >>> >>>> Jeff Squyres >> >>> >>>> Cisco Systems >> >>> >>>> >> >>> >>>> _______________________________________________ >> >>> >>>> devel mailing list >> >>> >>>> de...@open-mpi.org >> >>> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >>> >>> >> >>> >>> _______________________________________________ >> >>> >>> devel mailing list >> >>> >>> de...@open-mpi.org >> >>> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >>> >> >> >>> >> >> >>> > >> >>> > _______________________________________________ >> >>> > devel mailing list >> >>> > de...@open-mpi.org >> >>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >>> >> >>> _______________________________________________ >> >>> devel mailing list >> >>> de...@open-mpi.org >> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >>> >> >> >> >> >> > >> > _______________________________________________ >> > devel mailing list >> > de...@open-mpi.org >> > http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > -- > Jeff Squyres > Cisco Systems > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel