Well, whatever was stuck, I had to do a rmmod to remove the drbd module
from the kernel, then modprobe it back in, and the "stuck" Secondary
indication went away.

Doug

On Fri, 2007-04-20 at 14:30 -0400, Doug Knight wrote:

> I completely shutdown heartbeat on both nodes, cleared out the backup
> cib.xml files, recopied the cib.xml from the primary node to the
> secondary, then brought everything back up. This cleared the "diff"
> error. The drbd master/slave pair came up as expected, but when I tried
> to stop them, they eventually went into an unmanaged state. Looking at
> the logs and comparing to the stop function in the OCF script, I noticed
> that I was seeing a successful "drbdadm down", but the additional check
> for status after the down was indicating that the down was unsuccessful
> (from checking drbdadm state). Further, I manually verified that indeed
> the drbd processes were down, and executed the following:
> 
> [EMAIL PROTECTED] xml]# /sbin/drbdadm -c /etc/drbd.conf state pgsql
> Secondary/Unknown
> [EMAIL PROTECTED] xml]# cat /proc/drbd
> version: 8.0.1 (api:86/proto:86)
> SVN Revision: 2784 build by [EMAIL PROTECTED], 2007-04-09 11:30:31
>  0: cs:Unconfigured
> 
> Its the same output on either node, and drbd is definitely down on both
> nodes. So, /proc/drbd correctly indicates drbd is down, but the
> subsequent check using drbdadm state comes back indicating one side is
> up in Secondary mode, which its not. This is why the resource is now in
> unmanaged mode. Any ideas why the two tools would differ?
> 
> Doug
> 
> On Fri, 2007-04-20 at 11:35 -0400, Doug Knight wrote:
> 
> > In the interim I set the filesystem group to unmanaged to test failing
> > the drbd master/slave processes back and forth, using the the value part
> > of the place constraint. On my first attempt to switch nodes, it
> > basically took both drbd processes down, and they stayed down. When I
> > checked the logs on the node to which I was switching the primary drbd I
> > found a message about a failed application diff. I switched the place
> > constraint back to the original node. I decided to shutdown heartbeat on
> > the node where I was seeing the diff error, now the shutdown is hung and
> > the diff error below is repeating every minute:
> > 
> > cib[3040]: 2007/04/20_11:24:52 WARN: cib_process_diff: Diff 0.11.587 ->
> > 0.11.588 not applied to 0.11.593: current "num_updates" is greater than
> > required
> > cib[3040]: 2007/04/20_11:24:52 WARN: do_cib_notify: cib_apply_diff of
> > <diff > FAILED: Application of an update diff failed
> > cib[3040]: 2007/04/20_11:24:52 WARN: cib_process_request: cib_apply_diff
> > operation failed: Application of an update diff failed
> > cib[3040]: 2007/04/20_11:24:52 WARN: cib_process_diff: Diff 0.11.588 ->
> > 0.11.589 not applied to 0.11.593: current "num_updates" is greater than
> > required
> > cib[3040]: 2007/04/20_11:24:52 WARN: do_cib_notify: cib_apply_diff of
> > <diff > FAILED: Application of an update diff failed
> > cib[3040]: 2007/04/20_11:24:52 WARN: cib_process_request: cib_apply_diff
> > operation failed: Application of an update diff failed
> > 
> > 
> > I (and my boss) are kind of getting frustrated getting this setup to
> > work. Is there something obvious I'm missing? Has anyone ever had HA
> > 2.0.8, using v2 monitoring and drbd ocf script, and drbd version 8.0.1
> > working in a two node cluster? I'm concerned because of the comment made
> > earlier by Bernhard.
> > 
> > Doug
> > 
> > On Fri, 2007-04-20 at 10:55 -0400, Doug Knight wrote:
> > 
> > > I changed the constraints to point to the master_slave ID, and voila,
> > > even without the Filesystem resource running, the drbd resource
> > > recognized the place constraint and the GUI now indicates master running
> > > wher I expected it to. One down, one to go. Now, just to be sure, here's
> > > the modified group XML with the notify nvpair added:
> > > 
> > > <group ordered="true" collocated="true" id="grp_pgsql_mirror">
> > >    <primitive class="ocf" type="Filesystem" provider="heartbeat"
> > > id="fs_mirror">
> > >      <instance_attributes id="fs_mirror_instance_attrs">
> > >        <attributes>
> > >          <nvpair id="fs_mirror_device" name="device"
> > > value="/dev/drbd0"/>
> > >          <nvpair id="fs_mirror_directory" name="directory"
> > > value="/mirror"/>
> > >          <nvpair id="fs_mirror_fstype" name="fstype" value="ext3"/>
> > >          <nvpair id="fs_notify" name="notify" value="true"/>
> > >        </attributes>
> > >      </instance_attributes>
> > >    </primitive>
> > >    <instance_attributes id="grp_pgsql_mirror_instance_attrs">
> > >      <attributes/>
> > >    </instance_attributes>
> > >  </group>
> > > 
> > > I wanted to confirm I put it in the right place, as there was an
> > > instance_attributes tag for both the primitive resource within the
> > > group, and for the group itself. I put it in the resource tag, per your
> > > statement below, is that correct?
> > > 
> > > Doug
> > > 
> > > On Fri, 2007-04-20 at 16:06 +0200, Andrew Beekhof wrote:
> > > 
> > > > On 4/20/07, Knight, Doug <[EMAIL PROTECTED]> wrote:
> > > > > OK, here's what happened. The drbd resources were both successfully
> > > > > running in Secondary mode on both servers, and both partitions were
> > > > > synched. My Filesystem resource was stopped, with the colocation, 
> > > > > order,
> > > > > and place constraints in place. When I started the Filesystem 
> > > > > resource,
> > > > > which is part of a group, it triggered the appropriate drbd slave to
> > > > > promote to master and transition to Primary. However, The Filesystem
> > > > > resource did not complete or mount the partition, which I believe is
> > > > > because Notify is not enabled on it. A manual cleanup finally got it 
> > > > > to
> > > > > start and mount, following all of the constraints I had defined. 
> > > > > Next, I
> > > > > tried putting the server which was drbd primary into Standby state,
> > > > > which caused all kinds of problems (hung process, hung GUI, heartbeat
> > > > > shutdown wouldn't complete, etc). I finally had to restart heartbeat 
> > > > > on
> > > > > the server I was trying to send into Standby state (note that this 
> > > > > node
> > > > > was also the DC at the time). So, I'm back up to where I have drbd in
> > > > > slave/slave, secondary/secondary mode, and filesystem stopped.
> > > > >
> > > > > I wanted to add notify="true" to either the filesystem resource itself
> > > > > or to its group, but the DTD does not define notify for groups (even
> > > > > though for some reason the GUI thinks you CAN define the notify
> > > > > attribute). I plan on eventually adding an IPaddr and a pgsql resource
> > > > > to this group. So I have two questions: 1) Where does it make more 
> > > > > sense
> > > > > to add notify, at the group level or for the individual resource; and 
> > > > > 2)
> > > > > Should the DTD define notify as an attribute of groups?
> > > > 
> > > > add it as a resource attribute
> > > > 
> > > >      <group ...>
> > > >         <instance_attributes id="...">
> > > >           <attributes>
> > > >             <nvpair id="..." name="notify" value="true"/>
> > > > _______________________________________________
> > > > Linux-HA mailing list
> > > > [email protected]
> > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > > See also: http://linux-ha.org/ReportingProblems
> > > > 
> > > _______________________________________________
> > > Linux-HA mailing list
> > > [email protected]
> > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > See also: http://linux-ha.org/ReportingProblems
> > > 
> > _______________________________________________
> > Linux-HA mailing list
> > [email protected]
> > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > See also: http://linux-ha.org/ReportingProblems
> > 
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
> 
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to