Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

Brad Benton Thu, 12 Mar 2009 13:34:22 -0400

I think that this is relatively contained and has not been seen out of MTT
under normal operating conditions.  Also, as Jeff has argued, it doesn't
appear to be a regression against 1.3.  George & I talked about this and we
are in agreement that we should go ahead and release 1.3.1 as it currently
stands.
--brad



On Wed, Mar 11, 2009 at 7:58 AM, Jeff Squyres <jsquy...@cisco.com> wrote:

> So -- Brad/George -- this technically isn't a regression against v1.3.0 (do
> we know if this can happen in 1.2?  I don't recall seeing it there, but if
> it's so elusive...  I haven't been MTT testing the 1.2 series in a long
> time).  But it is a nonzero problem.
>
> Should we release 1.3.1 without a fix?
>
>
> On Mar 11, 2009, at 7:30 AM, Ralph Castain wrote:
>
>  I actually wasn't implying that Eugene's changes -caused- the problem,
>> but rather that I thought they might have -fixed- the problem.
>>
>> :-)
>>
>>
>> On Mar 11, 2009, at 4:34 AM, Terry Dontje wrote:
>>
>> > I forgot to mention that since I ran into this issue so long ago I
>> > really doubt that Eugene's SM changes has caused this issue.
>> >
>> > --td
>> >
>> > Terry Dontje wrote:
>> >> Hey!!!  I ran into this problem many months ago but its been so
>> >> elusive that I've haven't nailed it down.  First time we saw this
>> >> was last October.  I did some MTT gleaning and could not find
>> >> anyone but Solaris having this issue under MTT.  What's interesting
>> >> is I gleaned Sun's MTT results and could not find any of these
>> >> failures as far back as last October.
>> >> What it looked like to me was that the shared memory segment might
>> >> not have been initialized with 0's thus allowing one of the
>> >> processes to start accessing addresses that did not have an
>> >> appropriate address.  However, when I was looking at this I was
>> >> told the mmap file was created with ftruncate which essentially
>> >> should 0 fill the memory.  So I was at a loss as to why this was
>> >> happening.
>> >>
>> >> I was able to reproduce this for a little while manually setting up
>> >> a script that ran and small np=2 program over and over for sometime
>> >> under 3-4 days.  But around November I was unable to reproduce the
>> >> issue after 4 days of runs and threw up my hands until I was able
>> >> to find more failures under MTT which for Sun I haven't.
>> >>
>> >> Note that I was able to reproduce this issue with both SPARC and
>> >> Intel based platforms.
>> >>
>> >> --td
>> >>
>> >> Ralph Castain wrote:
>> >>> Hey Jeff
>> >>>
>> >>> I seem to recall seeing the identical problem reported on the user
>> >>> list not long ago...or may have been the devel list. Anyway, it
>> >>> was during btl_sm_add_procs, and the code was segv'ing.
>> >>>
>> >>> I don't have the archives handy here, but perhaps you might search
>> >>> them and see if there is a common theme here. IIRC, some of
>> >>> Eugene's fixes impacted this problem.
>> >>>
>> >>> Ralph
>> >>>
>> >>>
>> >>> On Mar 10, 2009, at 7:49 PM, Jeff Squyres wrote:
>> >>>
>> >>>> On Mar 10, 2009, at 9:13 PM, Jeff Squyres (jsquyres) wrote:
>> >>>>
>> >>>>> Doh -- I'm seeing intermittent sm btl failures on Cisco 1.3.1
>> >>>>> MTT.  :-
>> >>>>> (  I can't reproduce them manually, but they seem to only happen
>> >>>>> in a
>> >>>>> very small fraction of overall MTT runs.  I'm seeing at least 3
>> >>>>> classes of errors:
>> >>>>>
>> >>>>> 1. btl_sm_add_procs.c:529 which is this:
>> >>>>>
>> >>>>>       if(mca_btl_sm_component.fifo[j][my_smp_rank].head_lock !=
>> >>>>> NULL) {
>> >>>>>
>> >>>>> j = 3, my_smp_rank = 1.  mca_btl_sm_component.fifo[j][my_smp_rank]
>> >>>>> appears to have a valid value in it (i.e., .fifo[3][0] =
>> >>>>> x, .fifo[3]
>> >>>>> [1] = x+offset, .fifo[3][2] = x+2*offset, .fifo[3][3] = x
>> >>>>> +3*offset.
>> >>>>> But gdb says:
>> >>>>>
>> >>>>> (gdb) print mca_btl_sm_component.fifo[j][my_smp_rank]
>> >>>>> Cannot access memory at address 0x2a96b73050
>> >>>>>
>> >>>>
>> >>>>
>> >>>> Bah -- this is a red herring; this memory is in the shared memory
>> >>>> segment, and that memory is not saved in the corefile.  So of
>> >>>> course gdb can't access it (I just did a short controlled test
>> >>>> and proved this to myself).
>> >>>>
>> >>>> But I don't understand why I would have a bunch of tests that all
>> >>>> segv at btl_sm_add_procs.c:529.  :-(
>> >>>>
>> >>>> --
>> >>>> Jeff Squyres
>> >>>> Cisco Systems
>> >>>>
>> >>>> _______________________________________________
>> >>>> devel mailing list
>> >>>> de...@open-mpi.org
>> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >>>
>> >>> _______________________________________________
>> >>> devel mailing list
>> >>> de...@open-mpi.org
>> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> >>
>> >>
>> >
>> > _______________________________________________
>> > devel mailing list
>> > de...@open-mpi.org
>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>
>>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

Reply via email to