On Thu, Feb 11, 2016 at 01:39:09PM -0500, Bob Peterson wrote: > The problem is: While testing the dlm in multiple recovery situations, > Nate and I discovered multiple problems. Until recently, no one has tried > to run recovery tests on an upstream DLM,
(Let's distinguish tcp connection testing/recovery vs locking testing/recovery. I agree we've never looked at the tcp connections too much since the node is typically dead anyway.) > I agree that some of these patches might be unnecessary improvements. > I'll try to pare them down to what is absolutely necessary and what > is not. I'll also document exactly why the necessary ones are needed. Improvements are fine, I was just confused about which were fixes vs cleanups. > I'll also try to post them in order of highest priority and repost > them as individual patches rather than a set. > > The recovery tests are somewhat slow, so this will take some time. > > BTW, Have you had a chance to look at the patch I posted on 18 January, > titled "DLM: Replace nodeid_to_addr with kernel_getpeername"? > That definitely fixes one bug in patch b3a5bbfd which you mentioned. Great, thanks, that's the key one that I'd missed or forgotten. > I assume you're not suggesting I combine that patch with other patches > to stabilize b3a5bbfd, right? As you well know, this is very touchy > code and it's easier to diagnose and debug a larger number of smaller > patches. No, I don't have any concerns with the other improvements/fixes you have since the main issue was fixed in that nodeid_to_addr replacement.