Hi list,

Sorry for reposting this mail if you have received it, since it seems
to me that I haven't successfully sent it out, maybe the attachment is
too large. I'll send the attachment separately if anoyone is
interested in it :)

Hi David,

This patch was made about eight months ago, however, at that time I
have no way to regularly reproduce it and haven't verified if this
patch really works, so I set it aside for so long ...

These months, I also heard of many users reporting this issue but
unfortunately when I requested them to test the patch, most of them
even didn't try it at all :(

Luckily, things have changed now. One user met this issue two months
ago and he's also very kindly to test the patch. The result is the
patch really works.

Attached is the log before they apply the patch. This time the log
has already included the debugging messages which were added by the
commit 27b09badd40a2d1500500fa6945aeb532f75bd13 , so we can see what
really happens on the user's site.
(The log is a bit large when it was uncompressed, this is because the
spinning would print many messages to the log.)

I rebased the patch against current upstream code now. Thank you for
your review in advance :)

Signed-off-by: Jiaju Zhang <jjzhang.li...@gmail.com>
---
 group/dlm_controld/cpg.c |    1 +
 1 files changed, 1 insertions(+), 0 deletions(-)

diff --git a/group/dlm_controld/cpg.c b/group/dlm_controld/cpg.c
index 9b0d223..12cb202 100644
--- a/group/dlm_controld/cpg.c
+++ b/group/dlm_controld/cpg.c
@@ -651,6 +651,7 @@ static int check_fs_done(struct lockspace *ls)
                if (node->fs_notified) {
                        log_group(ls, "check_fs nodeid %d clear", node->nodeid);
                        node->check_fs = 0;
+                       node->fs_notified = 0;
                } else {
                        log_group(ls, "check_fs nodeid %d needs fs notify",
                                  node->nodeid);


On Mon, Mar 01, 2010 at 06:06:32AM +0800, Jiaju Zhang wrote:
> Hi,
> 
> About the issue that dlm_controld and fs_controld sit spinning,
> retrying and replying for the fs_notified check, I have a suspision
> that another scenario may also hit that logic:
> 
> If the node->fs_notified has been set to 1 by previous change, when a
> new change comes and needs to check the node->fs_notified, because it
> has not been reset to 0, so check_fs_done will succeed even if
> dlm_controld has not received the notification from fs_controld this
> time.
> For example, given that the following membership changes n, n+1, n+2,
> we see what happens on node X:
> Step 1: cg n: node Y leaves with CPG_REASON_NODEDOWN reason,
>         eventually in node X's ls->node_history, node Y's fs_notified
>         = 1
> Step 2: cg n+1: node Y joins ...
> Step 3: cg n+2: node Y leaves with CPG_REASON_NODEDOWN reason, one
>         possible scenario is: before fs_controld's notification
>         arrives, dlm_controld has known node Y is down from CPG
>         message and done a lot of work, and it saw node Y's
>         fs_notified = 1 (been set in Step 1) then passed the fs check
>         wrongly. So node Y's check_fs reset to 0.
> Step 4: fs_controld's notification arrives, it sees node Y's check_fs
>         = 0 and assumes dlm_controld has not known node Y is down and
>         retries to send the notification. But in fact, dlm_controld
>         has already known this and finished all the work, which will
>         result in the spinning ... 
> 
> I'm not sure if I read the code correctly :-) Below is the patch which
> reset the node->fs_notified. Review and comments are highly
> appreciated!
> 
> Thanks,
> Jiaju
> 
> Signed-off-by: Jiaju Zhang <jjzhang.li...@gmail.com>
> ---
>  group/dlm_controld/cpg.c |    1 +
>  1 files changed, 1 insertions(+), 0 deletions(-)
> 
> diff --git a/group/dlm_controld/cpg.c b/group/dlm_controld/cpg.c
> index d5245ce..b257595 100644
> --- a/group/dlm_controld/cpg.c
> +++ b/group/dlm_controld/cpg.c
> @@ -636,6 +636,7 @@ static int check_fs_done(struct lockspace *ls)
>  
>               if (node->fs_notified) {
>                       node->check_fs = 0;
> +                     node->fs_notified = 0;
>               } else {
>                       log_group(ls, "check_fs nodeid %d needs fs notify",
>                                 node->nodeid);

Reply via email to