Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

Ralph Castain Wed, 11 Mar 2009 12:22:29 -0400

If it is that hard to replicate outside of MTT, then by all meanslet's just release it - users will probably never see it.


On Mar 11, 2009, at 10:07 AM, Terry Dontje wrote:

Ralph Castain wrote:
You know, this isn't the first time we have encountered errors that-only- appear when running under MTT. As per my other note, we arenot seeing these failures here, even though almost all our usersrun under batch/scripts.
This has been the case with at least some of these other MTT-onlyerrors as well. It can't help but make one wonder if there isn'tsomething about MTT that is causing these failures to occur. Itjust seems too bizarre that a true code problem would -only- showitself when executing under MTT. You would think that it would haveto appear in a script and/or batch environment as well.
Just something to consider.
Ok, I actually have reproduced this error outside of MTT. But ittook a script running the same program for over a couple days. Soin this particular instance I don't believe MTT is adding anybadness other than possibly adding a load to the system.
--td
On Mar 11, 2009, at 9:38 AM, Jeff Squyres wrote:
As Terry stated, I think this bugger is quite rare. I'm having ahelluva time trying to reproduce it manually (over 5k runs thismorning and still no segv). Ugh.
Looking through the sm startup code, I can't see exactly what theproblem would be. :-(
On Mar 11, 2009, at 11:34 AM, Ralph Castain wrote:
I'll run some tests with 1.3.1 on one of our systems and see if it
shows up there. If it is truly rare and was in 1.3.0, then I
personally don't have a problem with it. Got bigger problems with
hanging collectives, frankly - and we don't know how the sm changes
will affect this problem, if at all.


On Mar 11, 2009, at 7:50 AM, Terry Dontje wrote:

> Jeff Squyres wrote:
>> So -- Brad/George -- this technically isn't a regression against
>> v1.3.0 (do we know if this can happen in 1.2?  I don't recall
>> seeing it there, but if it's so elusive...  I haven't been MTT
>> testing the 1.2 series in a long time). But it is a nonzeroproblem.
>>
> I have not seen 1.2 fail with this problem but I honestly don'tknow
> if that is a fluke or not.
>
> --td
>
>> Should we release 1.3.1 without a fix?
>>
>
>>
>> On Mar 11, 2009, at 7:30 AM, Ralph Castain wrote:
>>
>>> I actually wasn't implying that Eugene's changes -caused- the
>>> problem,
>>> but rather that I thought they might have -fixed- the problem.
>>>
>>> :-)
>>>
>>>
>>> On Mar 11, 2009, at 4:34 AM, Terry Dontje wrote:
>>>
>>> > I forgot to mention that since I ran into this issue solong ago I
>>> > really doubt that Eugene's SM changes has caused this issue.
>>> >
>>> > --td
>>> >
>>> > Terry Dontje wrote:
>>> >> Hey!!! I ran into this problem many months ago but itsbeen so>>> >> elusive that I've haven't nailed it down. First time wesaw this>>> >> was last October. I did some MTT gleaning and could notfind
>>> >> anyone but Solaris having this issue under MTT.  What's
>>> interesting
>>> >> is I gleaned Sun's MTT results and could not find any ofthese
>>> >> failures as far back as last October.
>>> >> What it looked like to me was that the shared memory segment
>>> might
>>> >> not have been initialized with 0's thus allowing one of the
>>> >> processes to start accessing addresses that did not have an
>>> >> appropriate address. However, when I was looking at thisI was>>> >> told the mmap file was created with ftruncate whichessentially>>> >> should 0 fill the memory. So I was at a loss as to whythis was
>>> >> happening.
>>> >>
>>> >> I was able to reproduce this for a little while manually
>>> setting up
>>> >> a script that ran and small np=2 program over and over for
>>> sometime
>>> >> under 3-4 days. But around November I was unable toreproduce
>>> the
>>> >> issue after 4 days of runs and threw up my hands until Iwas able
>>> >> to find more failures under MTT which for Sun I haven't.
>>> >>
>>> >> Note that I was able to reproduce this issue with bothSPARC and
>>> >> Intel based platforms.
>>> >>
>>> >> --td
>>> >>
>>> >> Ralph Castain wrote:
>>> >>> Hey Jeff
>>> >>>
>>> >>> I seem to recall seeing the identical problem reported onthe
>>> user
>>> >>> list not long ago...or may have been the devel list.Anyway, it
>>> >>> was during btl_sm_add_procs, and the code was segv'ing.
>>> >>>
>>> >>> I don't have the archives handy here, but perhaps you might
>>> search
>>> >>> them and see if there is a common theme here. IIRC, some of
>>> >>> Eugene's fixes impacted this problem.
>>> >>>
>>> >>> Ralph
>>> >>>
>>> >>>
>>> >>> On Mar 10, 2009, at 7:49 PM, Jeff Squyres wrote:
>>> >>>
>>> >>>> On Mar 10, 2009, at 9:13 PM, Jeff Squyres (jsquyres)wrote:
>>> >>>>
>>> >>>>> Doh -- I'm seeing intermittent sm btl failures on Cisco1.3.1
>>> >>>>> MTT.  :-
>>> >>>>> (  I can't reproduce them manually, but they seem to only
>>> happen
>>> >>>>> in a
>>> >>>>> very small fraction of overall MTT runs.  I'm seeing at
>>> least 3
>>> >>>>> classes of errors:
>>> >>>>>
>>> >>>>> 1. btl_sm_add_procs.c:529 which is this:
>>> >>>>>
>>> >>>>>       if(mca_btl_sm_component.fifo[j]
>>> [my_smp_rank].head_lock !=
>>> >>>>> NULL) {
>>> >>>>>
>>> >>>>> j = 3, my_smp_rank = 1.  mca_btl_sm_component.fifo[j]
>>> [my_smp_rank]
>>> >>>>> appears to have a valid value in it (i.e., .fifo[3][0] =
>>> >>>>> x, .fifo[3]
>>> >>>>> [1] = x+offset, .fifo[3][2] = x+2*offset, .fifo[3][3] = x
>>> >>>>> +3*offset.
>>> >>>>> But gdb says:
>>> >>>>>
>>> >>>>> (gdb) print mca_btl_sm_component.fifo[j][my_smp_rank]
>>> >>>>> Cannot access memory at address 0x2a96b73050
>>> >>>>>
>>> >>>>
>>> >>>>
>>> >>>> Bah -- this is a red herring; this memory is in the shared
>>> memory
>>> >>>> segment, and that memory is not saved in the corefile.So of>>> >>>> course gdb can't access it (I just did a shortcontrolled test
>>> >>>> and proved this to myself).
>>> >>>>
>>> >>>> But I don't understand why I would have a bunch of teststhat
>>> all
>>> >>>> segv at btl_sm_add_procs.c:529.  :-(
>>> >>>>
>>> >>>> --
>>> >>>> Jeff Squyres
>>> >>>> Cisco Systems
>>> >>>>
>>> >>>> _______________________________________________
>>> >>>> devel mailing list
>>> >>>> [email protected]
>>> >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> >>>
>>> >>> _______________________________________________
>>> >>> devel mailing list
>>> >>> [email protected]
>>> >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> >>
>>> >>
>>> >
>>> > _______________________________________________
>>> > devel mailing list
>>> > [email protected]
>>> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>> _______________________________________________
>>> devel mailing list
>>> [email protected]
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>
>>
>>
>
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Jeff Squyres
Cisco Systems

_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel
_______________________________________________
devel mailing list
[email protected]
http://www.open-mpi.org/mailman/listinfo.cgi/devel

Re: [OMPI devel] 1.3.1 -- bad MTT from Cisco

Reply via email to