On Wed, Mar/11/2009 11:38:19AM, Jeff Squyres wrote:
> As Terry stated, I think this bugger is quite rare.  I'm having a helluva 
> time trying to reproduce it manually (over 5k runs this morning and still 
> no segv).  Ugh.

5k of which test(s)? Can this error happen on any test? I am wondering
if we could narrow down to a smaller subset of the nightly tests to
reproduce this (the way Terry did by looping over the same test(s) for
a looong time). I see the following over the past 30 days:

  # | Date range          | Org   | Hostname          | Platform name  | 
Hardware | OS    | MPI name          | MPI version    | Suite    | Test         
              | np | Stdout           | Pass | Fail | Skip | Timed | Perf
  1 | 2009-02-12 06:47:56 | sun   | burl-ct-v440-2    | burl-ct-v440-2 | sun4u  
  | SunOS | ompi-nightly-v1.3 | 1.3.1a0r20508  | ibm-64   | cxx_create_disp     
       | 8  | btl_sm_add_procs | 0    | 1    | 0    | 0     | 0
  2 | 2009-02-27 23:37:02 | sun   | burl-ct-v20z-2    | burl-ct-v20z-2 | i86pc  
  | SunOS | ompi-nightly-v1.3 | 1.3.1rc1r20628 | ibm-64   | lbub                
       | 4  | btl_sm_add_procs | 0    | 1    | 0    | 0     | 0
  3 | 2009-03-05 00:15:39 | sun   | burl-ct-v20z-2    | burl-ct-v20z-2 | i86pc  
  | SunOS | ompi-nightly-v1.3 | 1.3.1rc3r20684 | ibm-32   | loop                
       | 4  | btl_sm_add_procs | 0    | 1    | 0    | 0     | 0
  4 | 2009-03-05 22:31:43 | sun   | burl-ct-v20z-2    | burl-ct-v20z-2 | i86pc  
  | SunOS | ompi-nightly-v1.3 | 1.3.1rc4r20704 | intel-64 | 
MPI_Type_size_MPI_LB_UB_c  | 4  | btl_sm_add_procs | 0    | 1    | 0    | 0     
| 0
  5 | 2009-03-10 14:47:36 | cisco | svbu-mpi[035-036] | svbu-mpi       | x86_64 
  | Linux | ompi-nightly-v1.3 | 1.3.1rc5r20730 | intel    | 
MPI_Test_cancelled_false_c | 8  | btl_sm_add_procs | 0    | 1    | 0    | 0     
| 0

What do these tests have in common?

  ./intel_tests/src/MPI_Test_cancelled_false_c.c
  ./intel_tests/src/MPI_Type_size_MPI_LB_UB_c.c
  ./ibm/onesided/cxx_create_disp.cc
  ./ibm/datatype/lbub2.c
  ./ibm/datatype/loop.c
  ./ibm/datatype/lbub.c

It almost looks like the problem is more likely to occur if MPI_UB or
MPI_LB is involved or am I just imagining things?

-Ethan

>
> Looking through the sm startup code, I can't see exactly what the problem 
> would be.  :-(
>
>
> On Mar 11, 2009, at 11:34 AM, Ralph Castain wrote:
>
>> I'll run some tests with 1.3.1 on one of our systems and see if it
>> shows up there. If it is truly rare and was in 1.3.0, then I
>> personally don't have a problem with it. Got bigger problems with
>> hanging collectives, frankly - and we don't know how the sm changes
>> will affect this problem, if at all.
>>
>>
>> On Mar 11, 2009, at 7:50 AM, Terry Dontje wrote:
>>
>> > Jeff Squyres wrote:
>> >> So -- Brad/George -- this technically isn't a regression against
>> >> v1.3.0 (do we know if this can happen in 1.2?  I don't recall
>> >> seeing it there, but if it's so elusive...  I haven't been MTT
>> >> testing the 1.2 series in a long time).  But it is a nonzero problem.
>> >>
>> > I have not seen 1.2 fail with this problem but I honestly don't know
>> > if that is a fluke or not.
>> >
>> > --td
>> >
>> >> Should we release 1.3.1 without a fix?
>> >>
>> >
>> >>
>> >> On Mar 11, 2009, at 7:30 AM, Ralph Castain wrote:
>> >>
>> >>> I actually wasn't implying that Eugene's changes -caused- the
>> >>> problem,
>> >>> but rather that I thought they might have -fixed- the problem.
>> >>>
>> >>> :-)
>> >>>
>> >>>
>> >>> On Mar 11, 2009, at 4:34 AM, Terry Dontje wrote:
>> >>>
>> >>> > I forgot to mention that since I ran into this issue so long ago I
>> >>> > really doubt that Eugene's SM changes has caused this issue.
>> >>> >
>> >>> > --td
>> >>> >
>> >>> > Terry Dontje wrote:
>> >>> >> Hey!!!  I ran into this problem many months ago but its been so
>> >>> >> elusive that I've haven't nailed it down.  First time we saw this
>> >>> >> was last October.  I did some MTT gleaning and could not find
>> >>> >> anyone but Solaris having this issue under MTT.  What's
>> >>> interesting
>> >>> >> is I gleaned Sun's MTT results and could not find any of these
>> >>> >> failures as far back as last October.
>> >>> >> What it looked like to me was that the shared memory segment
>> >>> might
>> >>> >> not have been initialized with 0's thus allowing one of the
>> >>> >> processes to start accessing addresses that did not have an
>> >>> >> appropriate address.  However, when I was looking at this I was
>> >>> >> told the mmap file was created with ftruncate which essentially
>> >>> >> should 0 fill the memory.  So I was at a loss as to why this was
>> >>> >> happening.
>> >>> >>
>> >>> >> I was able to reproduce this for a little while manually
>> >>> setting up
>> >>> >> a script that ran and small np=2 program over and over for
>> >>> sometime
>> >>> >> under 3-4 days.  But around November I was unable to reproduce
>> >>> the
>> >>> >> issue after 4 days of runs and threw up my hands until I was able
>> >>> >> to find more failures under MTT which for Sun I haven't.
>> >>> >>
>> >>> >> Note that I was able to reproduce this issue with both SPARC and
>> >>> >> Intel based platforms.
>> >>> >>
>> >>> >> --td
>> >>> >>
>> >>> >> Ralph Castain wrote:
>> >>> >>> Hey Jeff
>> >>> >>>
>> >>> >>> I seem to recall seeing the identical problem reported on the
>> >>> user
>> >>> >>> list not long ago...or may have been the devel list. Anyway, it
>> >>> >>> was during btl_sm_add_procs, and the code was segv'ing.
>> >>> >>>
>> >>> >>> I don't have the archives handy here, but perhaps you might
>> >>> search
>> >>> >>> them and see if there is a common theme here. IIRC, some of
>> >>> >>> Eugene's fixes impacted this problem.
>> >>> >>>
>> >>> >>> Ralph
>> >>> >>>
>> >>> >>>
>> >>> >>> On Mar 10, 2009, at 7:49 PM, Jeff Squyres wrote:
>> >>> >>>
>> >>> >>>> On Mar 10, 2009, at 9:13 PM, Jeff Squyres (jsquyres) wrote:
>> >>> >>>>
>> >>> >>>>> Doh -- I'm seeing intermittent sm btl failures on Cisco 1.3.1
>> >>> >>>>> MTT.  :-
>> >>> >>>>> (  I can't reproduce them manually, but they seem to only
>> >>> happen
>> >>> >>>>> in a
>> >>> >>>>> very small fraction of overall MTT runs.  I'm seeing at
>> >>> least 3
>> >>> >>>>> classes of errors:
>> >>> >>>>>
>> >>> >>>>> 1. btl_sm_add_procs.c:529 which is this:
>> >>> >>>>>
>> >>> >>>>>       if(mca_btl_sm_component.fifo[j]
>> >>> [my_smp_rank].head_lock !=
>> >>> >>>>> NULL) {
>> >>> >>>>>
>> >>> >>>>> j = 3, my_smp_rank = 1.  mca_btl_sm_component.fifo[j]
>> >>> [my_smp_rank]
>> >>> >>>>> appears to have a valid value in it (i.e., .fifo[3][0] =
>> >>> >>>>> x, .fifo[3]
>> >>> >>>>> [1] = x+offset, .fifo[3][2] = x+2*offset, .fifo[3][3] = x
>> >>> >>>>> +3*offset.
>> >>> >>>>> But gdb says:
>> >>> >>>>>
>> >>> >>>>> (gdb) print mca_btl_sm_component.fifo[j][my_smp_rank]
>> >>> >>>>> Cannot access memory at address 0x2a96b73050
>> >>> >>>>>
>> >>> >>>>
>> >>> >>>>
>> >>> >>>> Bah -- this is a red herring; this memory is in the shared
>> >>> memory
>> >>> >>>> segment, and that memory is not saved in the corefile.  So of
>> >>> >>>> course gdb can't access it (I just did a short controlled test
>> >>> >>>> and proved this to myself).
>> >>> >>>>
>> >>> >>>> But I don't understand why I would have a bunch of tests that
>> >>> all
>> >>> >>>> segv at btl_sm_add_procs.c:529.  :-(
>> >>> >>>>
>> >>> >>>> --
>> >>> >>>> Jeff Squyres
>> >>> >>>> Cisco Systems
>> >>> >>>>
>> >>> >>>> _______________________________________________
>> >>> >>>> devel mailing list
>> >>> >>>> de...@open-mpi.org
>> >>> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >>> >>>
>> >>> >>> _______________________________________________
>> >>> >>> devel mailing list
>> >>> >>> de...@open-mpi.org
>> >>> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >>> >>
>> >>> >>
>> >>> >
>> >>> > _______________________________________________
>> >>> > devel mailing list
>> >>> > de...@open-mpi.org
>> >>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >>>
>> >>> _______________________________________________
>> >>> devel mailing list
>> >>> de...@open-mpi.org
>> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >>>
>> >>
>> >>
>> >
>> > _______________________________________________
>> > devel mailing list
>> > de...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
>
> -- 
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to