Re: [DRBD-user] DOPD problem and Heartbeat

[email protected] Wed, 18 Apr 2012 13:40:06 -0700

On 18.04.2012 21:49, Lars Ellenberg wrote:

On Wed, Apr 18, 2012 at 09:41:53PM +0200, aluno3 wrote:

On Wed, Apr 18, 2012 at 07:55:32PM +0200, [email protected] wrote:

Hello


We are testing DOPD mechanism and reviewing source of the dopd file
(http://hg.linux-ha.org/lha-2.1/file/1d5b54f0a2e0/contrib/drbd-outdate-peer/dopd.c).

Would you please use heartbeat 3
(and pacemaker, unless you use the haresource mode of heartbeat)

When using pacemaker, use the drbd crm-fence-peer.sh.
It covers all the cases dopd would cover, and in fact even a couple more
corner cases in multiple failure scenarios.

We would like to use heartbeat 3 with newer crm but our front end is not 
adapted yet...

Is it ok that in function check_drbd_peer, during loop, at the
beginning is checking status of the node and in case if node is dead
then function is finishing with returning FALSE even if node is ping
node? Next part of the code checks if node is 'normal' node, but it
is to late.

Then I guess we have to fix that.

Maybe fix should look like:

--- ./heartbeat/contrib/drbd-outdate-peer/dopd.c  2008-08-18 14:32:19.000000000 
+0200
+++ ./heartbeat-dopdfix/contrib/drbd-outdate-peer/dopd.c  2012-04-18 
20:10:41.000000000 +0200
@@ -226,7 +226,7 @@ check_drbd_peer(const char *drbd_peer)
         }
         while((node = dopd_cluster_conn->llc_ops->nextnode(dopd_cluster_conn)) 
!= NULL) {
                 const char *status = 
dopd_cluster_conn->llc_ops->node_status(dopd_cluster_conn, node);
-               if (!strcmp(status, "dead")) {
+               if (!strcmp(status, "dead")&&  !strcmp("normal", 
dopd_cluster_conn->llc_ops->node_type(dopd_cluster_conn, node))) {
                         cl_log(LOG_WARNING, "Cluster node: %s: status: %s",
                                node, status);
                         return FALSE;

I'd say, it should rather look like (against heartbeat 3 source, so it
may or may not directly apply on your tree; probably best to just copy
over all of contrib/drbd-outdate-peer from 3):

diff --git a/contrib/drbd-outdate-peer/dopd.c b/contrib/drbd-outdate-peer/dopd.c
--- a/contrib/drbd-outdate-peer/dopd.c
+++ b/contrib/drbd-outdate-peer/dopd.c
@@ -226,19 +226,26 @@ check_drbd_peer(const char *drbd_peer)
        }
        while((node = dopd_cluster_conn->llc_ops->nextnode(dopd_cluster_conn)) 
!= NULL) {
                const char *status = 
dopd_cluster_conn->llc_ops->node_status(dopd_cluster_conn, node);
+
+               /* Look for the peer */
+               if (strcasecmp(node, drbd_peer))
+                       continue;
+
+               if (strcmp("normal", 
dopd_cluster_conn->llc_ops->node_type(dopd_cluster_conn, node))) {
+                       cl_log(LOG_WARNING, "Cluster node: %s: status: %s is not a 
normal node",
+                              node, status);
+                       break;
+               }
+
                if (!strcmp(status, "dead")) {
                        cl_log(LOG_WARNING, "Cluster node: %s: status: %s",
                               node, status);
-                       return FALSE;
+                       break;
                }

-               /* Look for the peer */
-               if (!strcmp("normal", 
dopd_cluster_conn->llc_ops->node_type(dopd_cluster_conn, node))
-                       &&  !strcasecmp(node, drbd_peer)) {
-                       cl_log(LOG_DEBUG, "node %s found\n", node);
-                       found = TRUE;
-                       break;
-               }
+               cl_log(LOG_DEBUG, "node %s found with status %s\n", node, 
status);
+               found = TRUE;
+               break;
        }
        if (dopd_cluster_conn->llc_ops->end_nodewalk(dopd_cluster_conn) != 
HA_OK) {
                cl_log(LOG_INFO, "Cannot end node walk");


Not even compile tested, but I think this is what it should look like.

After fast test, looks like fix is working. Thanks for help.

In case when you have:
-configured ping node,
-timeouts: ping-int 10, deadping 10, deadtime 30

and link from replication, ping node down, dopd starts working. Function
check_drbd_peer checks if status of the node is dead (ping node is
dead, remote/normal node is ok) and if yes, ends with returning
FALSE and does not mark remote volumes as outdated with using other
auxiliary path. Unfortunately during test such problem occurred.

We know that DRBD timeouts have to be lower then heartbeat timeouts, but
in case when dopd has to mark a lot of remote resources, it cannot do
that in time. It is easy to race.

--
: Lars Ellenberg
: LINBIT | Your Way to High Availability
: DRBD/HA support and consulting http://www.linbit.com

DRBD® and LINBIT® are registered trademarks of LINBIT, Austria.
__
please don't Cc me, but send to list   --   I'm subscribed
_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user


_______________________________________________
drbd-user mailing list
[email protected]
http://lists.linbit.com/mailman/listinfo/drbd-user

Re: [DRBD-user] DOPD problem and Heartbeat

Reply via email to