r20948 still hangs, changing mpool_sm_min_size solves it. Lenny.
On Tue, Apr 7, 2009 at 3:42 AM, Eugene Loh <eugene....@sun.com> wrote: > George Bosilca wrote: > > You're right, the sentence was messed-up. My intent was to say that I >> found the problem, made a fix and once this fix applied to the trunk I was >> not able to reproduce the deadlock. >> > > But you were able to reproduce the deadlock before you made the fix? > > Anyhow, if I get fresh bits (through r20947) and I back out r20944 (either > in the source code or simply by setting the mpool_sm_min_size MCA parameter > to 0), I get deadlock. > > Based on your description of the bug I forced osu_bw to send 1024 non- >> blocking sends (instead of the default 64), and I still don't get the >> deadlock. I'm trilled ... >> > > Yes, that's a good test. You're sure you had mpool_sm_min_size set to 0? > I just don't have the same luck you do. I get the hang even with your > fixes. > > > On Apr 6, 2009, at 19:56 , Eugene Loh wrote: >> >> George Bosilca wrote: >>> >>> I got some free time (yeh haw) and took a look at the OB1 PML in order >>>> to fix the issue. I think I found the problem, as I'm unable to reproduce >>>> this error. >>>> >>> >>> Sorry, this sentence has me baffled. Are you unable to reproduce the >>> problem before the fixes or afterwards? The first step is to reproduce the >>> problem, right? To do so: >>> >>> A) Back out r20944. Easy way to do that is just >>> >>> % setenv OMPI_MCA_mpool_sm_min_size 0 >>> >>> B) Check that osu_bw.c hangs when using sm and you reach rendezvous >>> message size. >>> >>> C) Introduce your changes and make sure that osu_bw.c runs to >>> completion. >>> >>> Can you please give it a try with 20946 and 20947 but without 20944? >>>> >>> >>> osu_bw.c hangs for me. The PML fix did not seem to work. >>> >> >> _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >