On Fri, Dec 04, 2015 at 09:51:53AM -0500, Bob Peterson wrote: > it's from the fenced process, and if so, queue the final put. That should > mitigate the problem.
Bob, I'm perplexed by the focus on fencing; this issue is broader than fencing as I mentioned in bz 1255872. Over the years that I've reported these issues, rarely if ever have they involving fencing. Any userland process, not just the fencing process, can allocate memory, fall into the general shrinking path, get into gfs2 and dlm, and end up blocked for some undefined time. That can cause problems in any number of ways. The specific problem you're focused on may be one of the easier ways of demonstrating the problem -- making the original userland process one of the cluster-related processes that gfs2/dlm depend on, combined with recovery when those processes do an especially large amount of work that gfs2/dlm require. But problems could occur if any process is forced to unwittingly do this dlm work, not just a cluster-related process, and it would not need to involve recovery (or fencing which is one small part of it). I believe in gfs1 and the original gfs2, gfs had its own mechanism/threads for shrinking its cache and doing the dlm work, and would not do anything from the generic shrinking paths because of this issue. I don't think it's reasonable to expect random, unsuspecting processes on the system to perform gfs2/dlm operations that are often remote, lengthy, indefinite, or unpredictable. I think gfs2 needs to do that kind of heavy lifting from its own dedicated contexts, or from processes that are explicitly choosing to use gfs2.