On Thu, 24 Apr 2008 12:07:03 -0700 Hal Rosenstock <[EMAIL PROTECTED]> wrote:
> On Thu, 2008-04-24 at 09:57 -0700, Ira Weiny wrote: > > > > One side comment on the non OpenSM aspect of this: > > > > > > Why is the node temporarily unavailable ? There is a "contract" that the > > > node makes with the SM that it clearly isn't honoring. Is any > > > investigation going on relative to this aspect of the issue ? > > > > > > > Yes, we are working on finding the root cause. I agree that the "contract" > > is > > not being honored. This is one of the reasons I was hesitant to implement > > any > > fix to be submitted. > > I think the two issues can be tackled in parallel. > > > I don't think this is truly a bug in the stack. > > Any ideas on what it is ? If not, would you be willing to try something > assuming the end node issue is easily reproducible ? The root cause is something to do with a users job causing this "soft lockup" in the kernel. We believe sometimes they will run the node (diskless/no swap) out of memory. Under the OOM condition I don't think the node can be trusted. Unfortunately, this is another case where we can't seem to reproduce the issue without the users job. :-( As per a previous email I was excited about Or mentioning perhaps another way to simulate this condition on the IB side. I have set that up and see some issues there. I will see what I can find. > > > However, I could see this causing issues for people[*] and it might be nice > > to > > have a "fix". > > Sure; both are issues which should be understood better and fixed IMO. Agreed, I have spoken with our other developer and he is still trying to get a reproducer. Ira _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
