r20948 still hangs, changing mpool_sm_min_size solves it.

Lenny.

On Tue, Apr 7, 2009 at 3:42 AM, Eugene Loh <eugene....@sun.com> wrote:

> George Bosilca wrote:
>
> You're right, the sentence was messed-up. My intent was to say that I
>>  found the problem, made a fix and once this fix applied to the trunk I  was
>> not able to reproduce the deadlock.
>>
>
> But you were able to reproduce the deadlock before you made the fix?
>
> Anyhow, if I get fresh bits (through r20947) and I back out r20944 (either
> in the source code or simply by setting the mpool_sm_min_size MCA parameter
> to 0), I get deadlock.
>
> Based on your description of the bug I forced osu_bw to send 1024 non-
>> blocking sends (instead of the default 64), and I still don't get the
>>  deadlock. I'm trilled ...
>>
>
> Yes, that's a good test.  You're sure you had mpool_sm_min_size set to 0?
>  I just don't have the same luck you do.  I get the hang even with your
> fixes.
>
>
> On Apr 6, 2009, at 19:56 , Eugene Loh wrote:
>>
>> George Bosilca wrote:
>>>
>>> I got some free time (yeh haw) and took a look at the OB1 PML in  order
>>>>  to fix the issue. I think I found the problem, as I'm unable  to reproduce
>>>> this error.
>>>>
>>>
>>> Sorry, this sentence has me baffled.  Are you unable to reproduce  the
>>> problem before the fixes or afterwards?  The first step is to  reproduce the
>>> problem, right?  To do so:
>>>
>>> A) Back out r20944.  Easy way to do that is just
>>>
>>>  % setenv OMPI_MCA_mpool_sm_min_size 0
>>>
>>> B)  Check that osu_bw.c hangs when using sm and you reach rendezvous
>>>  message size.
>>>
>>> C)  Introduce your changes and make sure that osu_bw.c runs to
>>>  completion.
>>>
>>> Can you please give it a try with 20946 and  20947 but without 20944?
>>>>
>>>
>>> osu_bw.c hangs for me.  The PML fix did not seem to work.
>>>
>>
>> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

Reply via email to