[Cluster-devel] [PATCH] dlm: Reset fs_notified when check_fs_done

2010-11-08 Thread Jiaju Zhang
Hi list,

Sorry for reposting this mail if you have received it, since it seems
to me that I haven't successfully sent it out, maybe the attachment is
too large. I'll send the attachment separately if anoyone is
interested in it :)

Hi David,

This patch was made about eight months ago, however, at that time I
have no way to regularly reproduce it and haven't verified if this
patch really works, so I set it aside for so long ...

These months, I also heard of many users reporting this issue but
unfortunately when I requested them to test the patch, most of them
even didn't try it at all :(

Luckily, things have changed now. One user met this issue two months
ago and he's also very kindly to test the patch. The result is the
patch really works.

Attached is the log before they apply the patch. This time the log
has already included the debugging messages which were added by the
commit 27b09badd40a2d1500500fa6945aeb532f75bd13 , so we can see what
really happens on the user's site.
(The log is a bit large when it was uncompressed, this is because the
spinning would print many messages to the log.)

I rebased the patch against current upstream code now. Thank you for
your review in advance :)

Signed-off-by: Jiaju Zhang jjzhang.li...@gmail.com
---
 group/dlm_controld/cpg.c |1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/group/dlm_controld/cpg.c b/group/dlm_controld/cpg.c
index 9b0d223..12cb202 100644
--- a/group/dlm_controld/cpg.c
+++ b/group/dlm_controld/cpg.c
@@ -651,6 +651,7 @@ static int check_fs_done(struct lockspace *ls)
if (node-fs_notified) {
log_group(ls, check_fs nodeid %d clear, node-nodeid);
node-check_fs = 0;
+   node-fs_notified = 0;
} else {
log_group(ls, check_fs nodeid %d needs fs notify,
  node-nodeid);


On Mon, Mar 01, 2010 at 06:06:32AM +0800, Jiaju Zhang wrote:
 Hi,
 
 About the issue that dlm_controld and fs_controld sit spinning,
 retrying and replying for the fs_notified check, I have a suspision
 that another scenario may also hit that logic:
 
 If the node-fs_notified has been set to 1 by previous change, when a
 new change comes and needs to check the node-fs_notified, because it
 has not been reset to 0, so check_fs_done will succeed even if
 dlm_controld has not received the notification from fs_controld this
 time.
 For example, given that the following membership changes n, n+1, n+2,
 we see what happens on node X:
 Step 1: cg n: node Y leaves with CPG_REASON_NODEDOWN reason,
 eventually in node X's ls-node_history, node Y's fs_notified
 = 1
 Step 2: cg n+1: node Y joins ...
 Step 3: cg n+2: node Y leaves with CPG_REASON_NODEDOWN reason, one
 possible scenario is: before fs_controld's notification
 arrives, dlm_controld has known node Y is down from CPG
 message and done a lot of work, and it saw node Y's
 fs_notified = 1 (been set in Step 1) then passed the fs check
 wrongly. So node Y's check_fs reset to 0.
 Step 4: fs_controld's notification arrives, it sees node Y's check_fs
 = 0 and assumes dlm_controld has not known node Y is down and
 retries to send the notification. But in fact, dlm_controld
 has already known this and finished all the work, which will
 result in the spinning ... 
 
 I'm not sure if I read the code correctly :-) Below is the patch which
 reset the node-fs_notified. Review and comments are highly
 appreciated!
 
 Thanks,
 Jiaju
 
 Signed-off-by: Jiaju Zhang jjzhang.li...@gmail.com
 ---
  group/dlm_controld/cpg.c |1 +
  1 files changed, 1 insertions(+), 0 deletions(-)
 
 diff --git a/group/dlm_controld/cpg.c b/group/dlm_controld/cpg.c
 index d5245ce..b257595 100644
 --- a/group/dlm_controld/cpg.c
 +++ b/group/dlm_controld/cpg.c
 @@ -636,6 +636,7 @@ static int check_fs_done(struct lockspace *ls)
  
   if (node-fs_notified) {
   node-check_fs = 0;
 + node-fs_notified = 0;
   } else {
   log_group(ls, check_fs nodeid %d needs fs notify,
 node-nodeid);



Re: [Cluster-devel] [PATCH] dlm: Reset fs_notified when check_fs_done

2010-11-08 Thread David Teigland
On Mon, Nov 08, 2010 at 11:05:49PM +0800, Jiaju Zhang wrote:
 Luckily, things have changed now. One user met this issue two months
 ago and he's also very kindly to test the patch. The result is the
 patch really works.
 
 Attached is the log before they apply the patch. This time the log
 has already included the debugging messages which were added by the
 commit 27b09badd40a2d1500500fa6945aeb532f75bd13 , so we can see what
 really happens on the user's site.
 (The log is a bit large when it was uncompressed, this is because the
 spinning would print many messages to the log.)
 
 I rebased the patch against current upstream code now. Thank you for
 your review in advance :)

Thanks, the patch looks good, I'll push this out.

Dave