Re: [DRBD-user] DOPD problem and Heartbeat

Lars Ellenberg Wed, 18 Apr 2012 12:49:29 -0700

On Wed, Apr 18, 2012 at 09:41:53PM +0200, aluno3 wrote:
> > On Wed, Apr 18, 2012 at 07:55:32PM +0200, [email protected] wrote:
> > > Hello
> > > 
> > > We are testing DOPD mechanism and reviewing source of the dopd file
> > > (http://hg.linux-ha.org/lha-2.1/file/1d5b54f0a2e0/contrib/drbd-outdate-peer/dopd.c).
> > 
> > Would you please use heartbeat 3
> > (and pacemaker, unless you use the haresource mode of heartbeat)
> > 
> > When using pacemaker, use the drbd crm-fence-peer.sh.
> > It covers all the cases dopd would cover, and in fact even a couple more
> > corner cases in multiple failure scenarios.
> 
> We would like to use heartbeat 3 with newer crm but our front end is not 
> adapted yet...
> 
> > 
> > > 
> > > Is it ok that in function check_drbd_peer, during loop, at the
> > > beginning is checking status of the node and in case if node is dead
> > > then function is finishing with returning FALSE even if node is ping
> > > node? Next part of the code checks if node is 'normal' node, but it
> > > is to late.
> > 
> > Then I guess we have to fix that.
> 
> Maybe fix should look like:
> 
> --- ./heartbeat/contrib/drbd-outdate-peer/dopd.c  2008-08-18 
> 14:32:19.000000000 +0200
> +++ ./heartbeat-dopdfix/contrib/drbd-outdate-peer/dopd.c  2012-04-18 
> 20:10:41.000000000 +0200
> @@ -226,7 +226,7 @@ check_drbd_peer(const char *drbd_peer)
>         }
>         while((node = 
> dopd_cluster_conn->llc_ops->nextnode(dopd_cluster_conn)) != NULL) {
>                 const char *status = 
> dopd_cluster_conn->llc_ops->node_status(dopd_cluster_conn, node);
> -               if (!strcmp(status, "dead")) {
> +               if (!strcmp(status, "dead") && !strcmp("normal", 
> dopd_cluster_conn->llc_ops->node_type(dopd_cluster_conn, node))) {
>                         cl_log(LOG_WARNING, "Cluster node: %s: status: %s",
>                                node, status);
>                         return FALSE;


I'd say, it should rather look like (against heartbeat 3 source, so it
may or may not directly apply on your tree; probably best to just copy
over all of contrib/drbd-outdate-peer from 3):

diff --git a/contrib/drbd-outdate-peer/dopd.c b/contrib/drbd-outdate-peer/dopd.c
--- a/contrib/drbd-outdate-peer/dopd.c
+++ b/contrib/drbd-outdate-peer/dopd.c
@@ -226,19 +226,26 @@ check_drbd_peer(const char *drbd_peer)
        }
        while((node = dopd_cluster_conn->llc_ops->nextnode(dopd_cluster_conn)) 
!= NULL) {
                const char *status = 
dopd_cluster_conn->llc_ops->node_status(dopd_cluster_conn, node);
+
+               /* Look for the peer */
+               if (strcasecmp(node, drbd_peer))
+                       continue;
+
+               if (strcmp("normal", 
dopd_cluster_conn->llc_ops->node_type(dopd_cluster_conn, node))) {
+                       cl_log(LOG_WARNING, "Cluster node: %s: status: %s is 
not a normal node",
+                              node, status);
+                       break;
+               }
+
                if (!strcmp(status, "dead")) {
                        cl_log(LOG_WARNING, "Cluster node: %s: status: %s",
                               node, status);
-                       return FALSE;
+                       break;
                }
 
-               /* Look for the peer */
-               if (!strcmp("normal", 
dopd_cluster_conn->llc_ops->node_type(dopd_cluster_conn, node))
-                       && !strcasecmp(node, drbd_peer)) {
-                       cl_log(LOG_DEBUG, "node %s found\n", node);
-                       found = TRUE;
-                       break;
-               }
+               cl_log(LOG_DEBUG, "node %s found with status %s\n", node, 
status);
+               found = TRUE;
+               break;
        }
        if (dopd_cluster_conn->llc_ops->end_nodewalk(dopd_cluster_conn) != 
HA_OK) {
                cl_log(LOG_INFO, "Cannot end node walk");


Not even compile tested, but I think this is what it should look like.

> > > In case when you have:
> > > -configured ping node,
> > > -timeouts: ping-int 10, deadping 10, deadtime 30
> > > 
> > > and link from replication, ping node down, dopd starts working. Function
> > > check_drbd_peer checks if status of the node is dead (ping node is
> > > dead, remote/normal node is ok) and if yes, ends with returning
> > > FALSE and does not mark remote volumes as outdated with using other
> > > auxiliary path. Unfortunately during test such problem occurred.
> > > 
> > > We know that DRBD timeouts have to be lower then heartbeat timeouts, but
> > > in case when dopd has to mark a lot of remote resources, it cannot do
> > > that in time. It is easy to race.
> > 
> > -- 
> > : Lars Ellenberg
> > : LINBIT | Your Way to High Availability
> > : DRBD/HA support and consulting http://www.linbit.com
> > 
> > DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
> > __
> > please don't Cc me, but send to list   --   I'm subscribed
> > _______________________________________________
> > drbd-user mailing list
> > [email protected]
> > http://lists.linbit.com/mailman/listinfo/drbd-user
> > 
> 
> _______________________________________________
> drbd-user mailing list
> [email protected]
> http://lists.linbit.com/mailman/listinfo/drbd-user

-- 
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Re: [DRBD-user] DOPD problem and Heartbeat

Reply via email to