OK, unstuck, and moving forward with a patch from the DRBD email list...
I've got drbd configured in a fairly reliable Master/Slave setup, and I
can fail it back and forth between nodes using cibadmin and xml that
changes the Place constraint from node to node. (Not sure what this
means, but when the drbd processes first come up, the GUI indicates one
as Master, but does not show the other as Slave, only that it is
running. When I change the Place constraint, Master moves from one node
to the other, then the formerly Master node indicates Slave. From that
point on behavior is as expected.) Now, I've created a group containing
only a single Filesystem resource, colocated to the drbd master (based
on the previously discussed constraint rules of a -infinity for existing
on a stopped or slave drbd node), ordered to come up after the drbd
master. I'm using target_role to control whether HA starts it or not
(one xml sets target_role to stopped, the other started). First
question: What is the best way to start and stop resources, without
using the GUI (In other words, does my use of target_role a good way to
control resources)? Second question: Does it make more sense to have
target_role defined in the group instance_attributes or in the
instance_attributes within the individual primitive resource?

Thanks,
Doug

On Fri, 2007-04-20 at 14:46 -0400, Doug Knight wrote:

> Well, whatever was stuck, I had to do a rmmod to remove the drbd module
> from the kernel, then modprobe it back in, and the "stuck" Secondary
> indication went away.
> 
> Doug
> 
> On Fri, 2007-04-20 at 14:30 -0400, Doug Knight wrote:
> 
> > I completely shutdown heartbeat on both nodes, cleared out the backup
> > cib.xml files, recopied the cib.xml from the primary node to the
> > secondary, then brought everything back up. This cleared the "diff"
> > error. The drbd master/slave pair came up as expected, but when I tried
> > to stop them, they eventually went into an unmanaged state. Looking at
> > the logs and comparing to the stop function in the OCF script, I noticed
> > that I was seeing a successful "drbdadm down", but the additional check
> > for status after the down was indicating that the down was unsuccessful
> > (from checking drbdadm state). Further, I manually verified that indeed
> > the drbd processes were down, and executed the following:
> > 
> > [EMAIL PROTECTED] xml]# /sbin/drbdadm -c /etc/drbd.conf state pgsql
> > Secondary/Unknown
> > [EMAIL PROTECTED] xml]# cat /proc/drbd
> > version: 8.0.1 (api:86/proto:86)
> > SVN Revision: 2784 build by [EMAIL PROTECTED], 2007-04-09 11:30:31
> >  0: cs:Unconfigured
> > 
> > Its the same output on either node, and drbd is definitely down on both
> > nodes. So, /proc/drbd correctly indicates drbd is down, but the
> > subsequent check using drbdadm state comes back indicating one side is
> > up in Secondary mode, which its not. This is why the resource is now in
> > unmanaged mode. Any ideas why the two tools would differ?
> > 
> > Doug
> > 
> > On Fri, 2007-04-20 at 11:35 -0400, Doug Knight wrote:
> > 
> > > In the interim I set the filesystem group to unmanaged to test failing
> > > the drbd master/slave processes back and forth, using the the value part
> > > of the place constraint. On my first attempt to switch nodes, it
> > > basically took both drbd processes down, and they stayed down. When I
> > > checked the logs on the node to which I was switching the primary drbd I
> > > found a message about a failed application diff. I switched the place
> > > constraint back to the original node. I decided to shutdown heartbeat on
> > > the node where I was seeing the diff error, now the shutdown is hung and
> > > the diff error below is repeating every minute:
> > > 
> > > cib[3040]: 2007/04/20_11:24:52 WARN: cib_process_diff: Diff 0.11.587 ->
> > > 0.11.588 not applied to 0.11.593: current "num_updates" is greater than
> > > required
> > > cib[3040]: 2007/04/20_11:24:52 WARN: do_cib_notify: cib_apply_diff of
> > > <diff > FAILED: Application of an update diff failed
> > > cib[3040]: 2007/04/20_11:24:52 WARN: cib_process_request: cib_apply_diff
> > > operation failed: Application of an update diff failed
> > > cib[3040]: 2007/04/20_11:24:52 WARN: cib_process_diff: Diff 0.11.588 ->
> > > 0.11.589 not applied to 0.11.593: current "num_updates" is greater than
> > > required
> > > cib[3040]: 2007/04/20_11:24:52 WARN: do_cib_notify: cib_apply_diff of
> > > <diff > FAILED: Application of an update diff failed
> > > cib[3040]: 2007/04/20_11:24:52 WARN: cib_process_request: cib_apply_diff
> > > operation failed: Application of an update diff failed
> > > 
> > > 
> > > I (and my boss) are kind of getting frustrated getting this setup to
> > > work. Is there something obvious I'm missing? Has anyone ever had HA
> > > 2.0.8, using v2 monitoring and drbd ocf script, and drbd version 8.0.1
> > > working in a two node cluster? I'm concerned because of the comment made
> > > earlier by Bernhard.
> > > 
> > > Doug
> > > 
> > > On Fri, 2007-04-20 at 10:55 -0400, Doug Knight wrote:
> > > 
> > > > I changed the constraints to point to the master_slave ID, and voila,
> > > > even without the Filesystem resource running, the drbd resource
> > > > recognized the place constraint and the GUI now indicates master running
> > > > wher I expected it to. One down, one to go. Now, just to be sure, here's
> > > > the modified group XML with the notify nvpair added:
> > > > 
> > > > <group ordered="true" collocated="true" id="grp_pgsql_mirror">
> > > >    <primitive class="ocf" type="Filesystem" provider="heartbeat"
> > > > id="fs_mirror">
> > > >      <instance_attributes id="fs_mirror_instance_attrs">
> > > >        <attributes>
> > > >          <nvpair id="fs_mirror_device" name="device"
> > > > value="/dev/drbd0"/>
> > > >          <nvpair id="fs_mirror_directory" name="directory"
> > > > value="/mirror"/>
> > > >          <nvpair id="fs_mirror_fstype" name="fstype" value="ext3"/>
> > > >          <nvpair id="fs_notify" name="notify" value="true"/>
> > > >        </attributes>
> > > >      </instance_attributes>
> > > >    </primitive>
> > > >    <instance_attributes id="grp_pgsql_mirror_instance_attrs">
> > > >      <attributes/>
> > > >    </instance_attributes>
> > > >  </group>
> > > > 
> > > > I wanted to confirm I put it in the right place, as there was an
> > > > instance_attributes tag for both the primitive resource within the
> > > > group, and for the group itself. I put it in the resource tag, per your
> > > > statement below, is that correct?
> > > > 
> > > > Doug
> > > > 
> > > > On Fri, 2007-04-20 at 16:06 +0200, Andrew Beekhof wrote:
> > > > 
> > > > > On 4/20/07, Knight, Doug <[EMAIL PROTECTED]> wrote:
> > > > > > OK, here's what happened. The drbd resources were both successfully
> > > > > > running in Secondary mode on both servers, and both partitions were
> > > > > > synched. My Filesystem resource was stopped, with the colocation, 
> > > > > > order,
> > > > > > and place constraints in place. When I started the Filesystem 
> > > > > > resource,
> > > > > > which is part of a group, it triggered the appropriate drbd slave to
> > > > > > promote to master and transition to Primary. However, The Filesystem
> > > > > > resource did not complete or mount the partition, which I believe is
> > > > > > because Notify is not enabled on it. A manual cleanup finally got 
> > > > > > it to
> > > > > > start and mount, following all of the constraints I had defined. 
> > > > > > Next, I
> > > > > > tried putting the server which was drbd primary into Standby state,
> > > > > > which caused all kinds of problems (hung process, hung GUI, 
> > > > > > heartbeat
> > > > > > shutdown wouldn't complete, etc). I finally had to restart 
> > > > > > heartbeat on
> > > > > > the server I was trying to send into Standby state (note that this 
> > > > > > node
> > > > > > was also the DC at the time). So, I'm back up to where I have drbd 
> > > > > > in
> > > > > > slave/slave, secondary/secondary mode, and filesystem stopped.
> > > > > >
> > > > > > I wanted to add notify="true" to either the filesystem resource 
> > > > > > itself
> > > > > > or to its group, but the DTD does not define notify for groups (even
> > > > > > though for some reason the GUI thinks you CAN define the notify
> > > > > > attribute). I plan on eventually adding an IPaddr and a pgsql 
> > > > > > resource
> > > > > > to this group. So I have two questions: 1) Where does it make more 
> > > > > > sense
> > > > > > to add notify, at the group level or for the individual resource; 
> > > > > > and 2)
> > > > > > Should the DTD define notify as an attribute of groups?
> > > > > 
> > > > > add it as a resource attribute
> > > > > 
> > > > >      <group ...>
> > > > >         <instance_attributes id="...">
> > > > >           <attributes>
> > > > >             <nvpair id="..." name="notify" value="true"/>
> > > > > _______________________________________________
> > > > > Linux-HA mailing list
> > > > > [email protected]
> > > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > > > See also: http://linux-ha.org/ReportingProblems
> > > > > 
> > > > _______________________________________________
> > > > Linux-HA mailing list
> > > > [email protected]
> > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > > See also: http://linux-ha.org/ReportingProblems
> > > > 
> > > _______________________________________________
> > > Linux-HA mailing list
> > > [email protected]
> > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > See also: http://linux-ha.org/ReportingProblems
> > > 
> > _______________________________________________
> > Linux-HA mailing list
> > [email protected]
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> > 
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
> 
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to