Re: [Linux-HA] Cannot create group containing drbd using HB GUI

Doug Knight Tue, 24 Apr 2007 07:01:02 -0700

I settled on using cibadmin for now to change target_role. Now, when I
add the Filesystem resource, and try to start it up, it won't start. I'm
getting the following in my debug log:


pengine[24246]: 2007/04/24_09:40:38 WARN: unpack_rsc_op: Processing
failed op (rsc_drbd_7788:0_monitor_0) for rsc_drbd_7788:0 on
arc-dknightlx
pengine[24246]: 2007/04/24_09:40:38 WARN: unpack_rsc_op: Processing
failed op (rsc_drbd_7788:1_monitor_0) for rsc_drbd_7788:1 on
arc-dknightlx
pengine[24246]: 2007/04/24_09:40:38 WARN: unpack_rsc_op: Processing
failed op (fs_mirror_start_0) for fs_mirror on arc-dknightlx
pengine[24246]: 2007/04/24_09:40:38 WARN: unpack_rsc_op: Handling failed
start for fs_mirror on arc-dknightlx
pengine[24246]: 2007/04/24_09:40:38 info: determine_online_status: Node
arc-tkincaidlx.wsicorp.com is online
pengine[24246]: 2007/04/24_09:40:38 WARN: unpack_rsc_op: Processing
failed op (rsc_drbd_7788:0_monitor_0) for rsc_drbd_7788:0 on
arc-tkincaidlx.wsicorp.com
pengine[24246]: 2007/04/24_09:40:38 WARN: unpack_rsc_op: Processing
failed op (rsc_drbd_7788:1_monitor_0) for rsc_drbd_7788:1 on
arc-tkincaidlx.wsicorp.com
pengine[24246]: 2007/04/24_09:40:38 info: clone_print: Master/Slave Set:
ms_drbd_7788
pengine[24246]: 2007/04/24_09:40:38 info: native_print:
rsc_drbd_7788:0     (heartbeat::ocf:drbd):  Master arc-dknightlx
pengine[24246]: 2007/04/24_09:40:38 info: native_print:
rsc_drbd_7788:1     (heartbeat::ocf:drbd):  Slave
arc-tkincaidlx.wsicorp.com
pengine[24246]: 2007/04/24_09:40:38 info: group_print: Resource Group:
grp_pgsql_mirror
pengine[24246]: 2007/04/24_09:40:38 info: native_print:     fs_mirror
(heartbeat::ocf:Filesystem):    Stopped 
pengine[24246]: 2007/04/24_09:40:38 info: master_color: Promoting
rsc_drbd_7788:0
pengine[24246]: 2007/04/24_09:40:38 info: master_color: Promoted 1
instances of a possible 1 to master
pengine[24246]: 2007/04/24_09:40:38 notice: NoRoleChange: Leave resource
rsc_drbd_7788:0        (arc-dknightlx)
pengine[24246]: 2007/04/24_09:40:38 notice: NoRoleChange: Leave resource
rsc_drbd_7788:1        (arc-tkincaidlx.wsicorp.com)
pengine[24246]: 2007/04/24_09:40:38 notice: NoRoleChange: Leave resource
rsc_drbd_7788:0        (arc-dknightlx)
pengine[24246]: 2007/04/24_09:40:38 notice: NoRoleChange: Leave resource
rsc_drbd_7788:1        (arc-tkincaidlx.wsicorp.com)
pengine[24246]: 2007/04/24_09:40:38 info: master_color: Promoted 1
instances of a possible 1 to master
pengine[24246]: 2007/04/24_09:40:38 info: master_color: Promoted 1
instances of a possible 1 to master
pengine[24246]: 2007/04/24_09:40:38 WARN: native_color: Resource
fs_mirror cannot run anywhere

No ocf_log debug output is getting triggered from the drbd ocf script,
yet HA is saying something about "Processing failed op" on what looks
like a monitor command. Can anyone tell me what this means? Also, I have
not explicitly defined operations on any of my resources yet, could that
be part of the problem? 

Thanks,
Doug


On Tue, 2007-04-24 at 09:21 -0400, Doug Knight wrote:

> I can't seem to find any documentation on crm_attribute, other than the
> --help. Below is my group section under resources defined in my cib. How
> would I use crm_attribute to query for and/or modify target_role?
> 
>        <group ordered="true" collocated="true" id="grp_pgsql_mirror">
>          <primitive class="ocf" type="Filesystem" provider="heartbeat"
> id="fs_mirror">
>            <instance_attributes id="fs_mirror_instance_attrs">
>              <attributes>
>                <nvpair id="fs_mirror_device" name="device"
> value="/dev/drbd0"/>
>                <nvpair id="fs_mirror_directory" name="directory"
> value="/mirror"/>
>                <nvpair id="fs_mirror_fstype" name="fstype"
> value="ext3"/>
>                <nvpair id="fs_notify" name="notify" value="true"/>
>              </attributes>
>            </instance_attributes>
>          </primitive>
>          <instance_attributes id="grp_pgsql_mirror_instance_attrs">
>            <attributes>
>              <nvpair id="grp_target_role" name="target_role"
> value="stopped"/>
>            </attributes>
>          </instance_attributes>
>        </group>
> 
> Thanks,
> Doug
> On Tue, 2007-04-24 at 12:04 +0200, Dejan Muhamedagic wrote:
> 
> > On Mon, Apr 23, 2007 at 03:52:26PM -0400, Knight, Doug wrote:
> > > OK, unstuck, and moving forward with a patch from the DRBD email list...
> > > I've got drbd configured in a fairly reliable Master/Slave setup, and I
> > > can fail it back and forth between nodes using cibadmin and xml that
> > > changes the Place constraint from node to node. (Not sure what this
> > > means, but when the drbd processes first come up, the GUI indicates one
> > > as Master, but does not show the other as Slave, only that it is
> > > running. When I change the Place constraint, Master moves from one node
> > > to the other, then the formerly Master node indicates Slave. From that
> > > point on behavior is as expected.) Now, I've created a group containing
> > > only a single Filesystem resource, colocated to the drbd master (based
> > > on the previously discussed constraint rules of a -infinity for existing
> > > on a stopped or slave drbd node), ordered to come up after the drbd
> > > master. I'm using target_role to control whether HA starts it or not
> > > (one xml sets target_role to stopped, the other started). First
> > > question: What is the best way to start and stop resources, without
> > > using the GUI (In other words, does my use of target_role a good way to
> > > control resources)?
> > 
> > target_role=stopped is the right way. crm_attribute should do.
> > 
> > > Second question: Does it make more sense to have
> > > target_role defined in the group instance_attributes or in the
> > > instance_attributes within the individual primitive resource?
> > 
> > Whichever way you want it. It should work OK for the group, so if
> > you want to stop the whole group...
> > 
> > > 
> > > Thanks,
> > > Doug
> > > 
> > > On Fri, 2007-04-20 at 14:46 -0400, Doug Knight wrote:
> > > 
> > > > Well, whatever was stuck, I had to do a rmmod to remove the drbd module
> > > > from the kernel, then modprobe it back in, and the "stuck" Secondary
> > > > indication went away.
> > > > 
> > > > Doug
> > > > 
> > > > On Fri, 2007-04-20 at 14:30 -0400, Doug Knight wrote:
> > > > 
> > > > > I completely shutdown heartbeat on both nodes, cleared out the backup
> > > > > cib.xml files, recopied the cib.xml from the primary node to the
> > > > > secondary, then brought everything back up. This cleared the "diff"
> > > > > error. The drbd master/slave pair came up as expected, but when I 
> > > > > tried
> > > > > to stop them, they eventually went into an unmanaged state. Looking at
> > > > > the logs and comparing to the stop function in the OCF script, I 
> > > > > noticed
> > > > > that I was seeing a successful "drbdadm down", but the additional 
> > > > > check
> > > > > for status after the down was indicating that the down was 
> > > > > unsuccessful
> > > > > (from checking drbdadm state). Further, I manually verified that 
> > > > > indeed
> > > > > the drbd processes were down, and executed the following:
> > > > > 
> > > > > [EMAIL PROTECTED] xml]# /sbin/drbdadm -c /etc/drbd.conf state pgsql
> > > > > Secondary/Unknown
> > > > > [EMAIL PROTECTED] xml]# cat /proc/drbd
> > > > > version: 8.0.1 (api:86/proto:86)
> > > > > SVN Revision: 2784 build by [EMAIL PROTECTED], 2007-04-09 11:30:31
> > > > >  0: cs:Unconfigured
> > > > > 
> > > > > Its the same output on either node, and drbd is definitely down on 
> > > > > both
> > > > > nodes. So, /proc/drbd correctly indicates drbd is down, but the
> > > > > subsequent check using drbdadm state comes back indicating one side is
> > > > > up in Secondary mode, which its not. This is why the resource is now 
> > > > > in
> > > > > unmanaged mode. Any ideas why the two tools would differ?
> > > > > 
> > > > > Doug
> > > > > 
> > > > > On Fri, 2007-04-20 at 11:35 -0400, Doug Knight wrote:
> > > > > 
> > > > > > In the interim I set the filesystem group to unmanaged to test 
> > > > > > failing
> > > > > > the drbd master/slave processes back and forth, using the the value 
> > > > > > part
> > > > > > of the place constraint. On my first attempt to switch nodes, it
> > > > > > basically took both drbd processes down, and they stayed down. When 
> > > > > > I
> > > > > > checked the logs on the node to which I was switching the primary 
> > > > > > drbd I
> > > > > > found a message about a failed application diff. I switched the 
> > > > > > place
> > > > > > constraint back to the original node. I decided to shutdown 
> > > > > > heartbeat on
> > > > > > the node where I was seeing the diff error, now the shutdown is 
> > > > > > hung and
> > > > > > the diff error below is repeating every minute:
> > > > > > 
> > > > > > cib[3040]: 2007/04/20_11:24:52 WARN: cib_process_diff: Diff 
> > > > > > 0.11.587 ->
> > > > > > 0.11.588 not applied to 0.11.593: current "num_updates" is greater 
> > > > > > than
> > > > > > required
> > > > > > cib[3040]: 2007/04/20_11:24:52 WARN: do_cib_notify: cib_apply_diff 
> > > > > > of
> > > > > > <diff > FAILED: Application of an update diff failed
> > > > > > cib[3040]: 2007/04/20_11:24:52 WARN: cib_process_request: 
> > > > > > cib_apply_diff
> > > > > > operation failed: Application of an update diff failed
> > > > > > cib[3040]: 2007/04/20_11:24:52 WARN: cib_process_diff: Diff 
> > > > > > 0.11.588 ->
> > > > > > 0.11.589 not applied to 0.11.593: current "num_updates" is greater 
> > > > > > than
> > > > > > required
> > > > > > cib[3040]: 2007/04/20_11:24:52 WARN: do_cib_notify: cib_apply_diff 
> > > > > > of
> > > > > > <diff > FAILED: Application of an update diff failed
> > > > > > cib[3040]: 2007/04/20_11:24:52 WARN: cib_process_request: 
> > > > > > cib_apply_diff
> > > > > > operation failed: Application of an update diff failed
> > > > > > 
> > > > > > 
> > > > > > I (and my boss) are kind of getting frustrated getting this setup to
> > > > > > work. Is there something obvious I'm missing? Has anyone ever had HA
> > > > > > 2.0.8, using v2 monitoring and drbd ocf script, and drbd version 
> > > > > > 8.0.1
> > > > > > working in a two node cluster? I'm concerned because of the comment 
> > > > > > made
> > > > > > earlier by Bernhard.
> > > > > > 
> > > > > > Doug
> > > > > > 
> > > > > > On Fri, 2007-04-20 at 10:55 -0400, Doug Knight wrote:
> > > > > > 
> > > > > > > I changed the constraints to point to the master_slave ID, and 
> > > > > > > voila,
> > > > > > > even without the Filesystem resource running, the drbd resource
> > > > > > > recognized the place constraint and the GUI now indicates master 
> > > > > > > running
> > > > > > > wher I expected it to. One down, one to go. Now, just to be sure, 
> > > > > > > here's
> > > > > > > the modified group XML with the notify nvpair added:
> > > > > > > 
> > > > > > > <group ordered="true" collocated="true" id="grp_pgsql_mirror">
> > > > > > >    <primitive class="ocf" type="Filesystem" provider="heartbeat"
> > > > > > > id="fs_mirror">
> > > > > > >      <instance_attributes id="fs_mirror_instance_attrs">
> > > > > > >        <attributes>
> > > > > > >          <nvpair id="fs_mirror_device" name="device"
> > > > > > > value="/dev/drbd0"/>
> > > > > > >          <nvpair id="fs_mirror_directory" name="directory"
> > > > > > > value="/mirror"/>
> > > > > > >          <nvpair id="fs_mirror_fstype" name="fstype" 
> > > > > > > value="ext3"/>
> > > > > > >          <nvpair id="fs_notify" name="notify" value="true"/>
> > > > > > >        </attributes>
> > > > > > >      </instance_attributes>
> > > > > > >    </primitive>
> > > > > > >    <instance_attributes id="grp_pgsql_mirror_instance_attrs">
> > > > > > >      <attributes/>
> > > > > > >    </instance_attributes>
> > > > > > >  </group>
> > > > > > > 
> > > > > > > I wanted to confirm I put it in the right place, as there was an
> > > > > > > instance_attributes tag for both the primitive resource within the
> > > > > > > group, and for the group itself. I put it in the resource tag, 
> > > > > > > per your
> > > > > > > statement below, is that correct?
> > > > > > > 
> > > > > > > Doug
> > > > > > > 
> > > > > > > On Fri, 2007-04-20 at 16:06 +0200, Andrew Beekhof wrote:
> > > > > > > 
> > > > > > > > On 4/20/07, Knight, Doug <[EMAIL PROTECTED]> wrote:
> > > > > > > > > OK, here's what happened. The drbd resources were both 
> > > > > > > > > successfully
> > > > > > > > > running in Secondary mode on both servers, and both 
> > > > > > > > > partitions were
> > > > > > > > > synched. My Filesystem resource was stopped, with the 
> > > > > > > > > colocation, order,
> > > > > > > > > and place constraints in place. When I started the Filesystem 
> > > > > > > > > resource,
> > > > > > > > > which is part of a group, it triggered the appropriate drbd 
> > > > > > > > > slave to
> > > > > > > > > promote to master and transition to Primary. However, The 
> > > > > > > > > Filesystem
> > > > > > > > > resource did not complete or mount the partition, which I 
> > > > > > > > > believe is
> > > > > > > > > because Notify is not enabled on it. A manual cleanup finally 
> > > > > > > > > got it to
> > > > > > > > > start and mount, following all of the constraints I had 
> > > > > > > > > defined. Next, I
> > > > > > > > > tried putting the server which was drbd primary into Standby 
> > > > > > > > > state,
> > > > > > > > > which caused all kinds of problems (hung process, hung GUI, 
> > > > > > > > > heartbeat
> > > > > > > > > shutdown wouldn't complete, etc). I finally had to restart 
> > > > > > > > > heartbeat on
> > > > > > > > > the server I was trying to send into Standby state (note that 
> > > > > > > > > this node
> > > > > > > > > was also the DC at the time). So, I'm back up to where I have 
> > > > > > > > > drbd in
> > > > > > > > > slave/slave, secondary/secondary mode, and filesystem stopped.
> > > > > > > > >
> > > > > > > > > I wanted to add notify="true" to either the filesystem 
> > > > > > > > > resource itself
> > > > > > > > > or to its group, but the DTD does not define notify for 
> > > > > > > > > groups (even
> > > > > > > > > though for some reason the GUI thinks you CAN define the 
> > > > > > > > > notify
> > > > > > > > > attribute). I plan on eventually adding an IPaddr and a pgsql 
> > > > > > > > > resource
> > > > > > > > > to this group. So I have two questions: 1) Where does it make 
> > > > > > > > > more sense
> > > > > > > > > to add notify, at the group level or for the individual 
> > > > > > > > > resource; and 2)
> > > > > > > > > Should the DTD define notify as an attribute of groups?
> > > > > > > > 
> > > > > > > > add it as a resource attribute
> > > > > > > > 
> > > > > > > >      <group ...>
> > > > > > > >         <instance_attributes id="...">
> > > > > > > >           <attributes>
> > > > > > > >             <nvpair id="..." name="notify" value="true"/>
> > > > > > > > _______________________________________________
> > > > > > > > Linux-HA mailing list
> > > > > > > > [email protected]
> > > > > > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > > > > > > See also: http://linux-ha.org/ReportingProblems
> > > > > > > > 
> > > > > > > _______________________________________________
> > > > > > > Linux-HA mailing list
> > > > > > > [email protected]
> > > > > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > > > > > See also: http://linux-ha.org/ReportingProblems
> > > > > > > 
> > > > > > _______________________________________________
> > > > > > Linux-HA mailing list
> > > > > > [email protected]
> > > > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > > > > See also: http://linux-ha.org/ReportingProblems
> > > > > > 
> > > > > _______________________________________________
> > > > > Linux-HA mailing list
> > > > > [email protected]
> > > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > > > See also: http://linux-ha.org/ReportingProblems
> > > > > 
> > > > _______________________________________________
> > > > Linux-HA mailing list
> > > > [email protected]
> > > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > > See also: http://linux-ha.org/ReportingProblems
> > > > 
> > 
> > > _______________________________________________
> > > Linux-HA mailing list
> > > [email protected]
> > > http://lists.linux-ha.org/mailman/listinfo/linux-ha
> > > See also: http://linux-ha.org/ReportingProblems
> > 
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
> 
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Cannot create group containing drbd using HB GUI

Reply via email to