On 06/08/2013, at 5:27 PM, "Ulrich Windl" <[email protected]> wrote:
> Hi! > > I always wanted to know what "Detected action XXX from a different transition > ..." really means: > Does it indicate a programming error in the cluster stack? To me it sounds as > if at least two parties try to control a thing without agreeing what is to be > done... It just means "something unexpected happened", probably a recurring monitor that detected a failure. That string no longer appears in more recent versions. > > Regards, > Ulrich > > >>> Mark Nipper <[email protected]> schrieb am 05.08.2013 um 19:34 in > Nachricht > <[email protected]>: >> One of our DRBD clusters has 47 LUN's being published. >> We're using RHEL 6.4. Here are the various package versions >> being used: >> --- >> pacemaker-1.1.7-6.el6.x86_64 >> corosync-1.4.1-7.el6.x86_64 >> resource-agents-3.9.2-12.el6.x86_64 >> scsi-target-utils-1.0.24-2.el6.x86_64 >> >> Somewhere after 40 LUN's we started experiencing monitor >> failures of the most recent LUN's added to the cluster. Things >> like: >> --- >> Jul 26 23:47:39 [8557] stor01a crmd: info: process_lrm_event: >> LRM operation lun47_monitor_10000 (call=357, rc=7, cib-update=6790, >> confirmed=false) not running >> Jul 26 23:47:39 [8557] stor01a crmd: info: process_graph_event: >> Detected action lun47_monitor_10000 from a different transition: 5737 vs. >> 5793 >> Jul 26 23:47:39 [8557] stor01a crmd: info: abort_transition_graph: >> process_graph_event:476 - Triggered transition abort (complete=1, >> tag=lrm_rsc_op, id=lun47_last_failure_0, >> magic=0:7;192:5737:0:e16c8e9d-87ed-4132-a3b2-724a30b6cc73, cib=0.111.47) : >> Old event >> Jul 26 23:47:39 [8557] stor01a crmd: warning: update_failcount: >> Updating failcount for lun47 on stor01a after failed monitor: rc=7 >> (update=value++, time=1374900459) >> Jul 26 23:47:39 [8555] stor01a attrd: notice: attrd_trigger_update: >> Sending flush op to all hosts for: fail-count-lun47 (1) >> Jul 26 23:47:39 [8557] stor01a crmd: notice: do_state_transition: >> State transition S_IDLE -> S_POLICY_ENGINE [ input=I_PE_CALC >> cause=C_FSA_INTERNAL origin=abort_transition_graph ] >> Jul 26 23:47:39 [8555] stor01a attrd: notice: attrd_perform_update: >> Sent update 438: fail-count-lun47=1 >> Jul 26 23:47:39 [8555] stor01a attrd: notice: attrd_trigger_update: >> Sending flush op to all hosts for: last-failure-lun47 (1374900459) >> Jul 26 23:47:39 [8557] stor01a crmd: info: abort_transition_graph: >> te_update_diff:176 - Triggered transition abort (complete=1, tag=nvpair, >> id=status-stor01a-fail-count-lun47, name=fail-count-lun47, value=1, >> magic=NA, >> cib=0.111.48) : Transient attribute: update >> Jul 26 23:47:39 [8555] stor01a attrd: notice: attrd_perform_update: >> Sent update 441: last-failure-lun47=1374900459 >> --- >> >> So I decided to modify the resource agent as follows: >> --- >> --- iSCSILogicalUnit.orig 2013-08-05 12:15:03.185879119 -0500 >> +++ iSCSILogicalUnit 2013-08-01 11:31:24.768133374 -0500 >> @@ -305,12 +305,28 @@ >> if [ -z "$TID" ]; then >> # Our target is not configured, thus we're not >> # running. >> + echo "$(date) TID not found: ${TID}." >> >> /var/log/iscsi-ra.log >> return $OCF_NOT_RUNNING >> fi >> # This only looks for the backing store, but does not test >> # for the correct target ID and LUN. >> - tgtadm --lld iscsi --op show --mode target \ >> + tgt_output=$(tgtadm --lld iscsi --op show --mode target) >> + echo "$tgt_output" \ >> | grep -E -q "[[:space:]]+Backing store.*: >> ${OCF_RESKEY_path}" >> && return $OCF_SUCCESS >> + echo "$(date) first LUN failure: ${OCF_RESKEY_path}" >> >> /var/log/iscsi-ra.log >> + echo "$tgt_output" >> /var/log/iscsi-ra.log >> + sleep 1 >> + tgt_output=$(tgtadm --lld iscsi --op show --mode target) >> + echo "$tgt_output" \ >> + | grep -E -q "[[:space:]]+Backing store.*: >> ${OCF_RESKEY_path}" >> && return $OCF_SUCCESS >> + echo "$(date) second LUN failure: ${OCF_RESKEY_path}" >> >> /var/log/iscsi-ra.log >> + echo "$tgt_output" >> /var/log/iscsi-ra.log >> + sleep 1 >> + tgt_output=$(tgtadm --lld iscsi --op show --mode target) >> + echo "$tgt_output" \ >> + | grep -E -q "[[:space:]]+Backing store.*: >> ${OCF_RESKEY_path}" >> && return $OCF_SUCCESS >> + echo "$(date) third LUN failure: ${OCF_RESKEY_path}" >> >> /var/log/iscsi-ra.log >> + echo "$tgt_output" >> /var/log/iscsi-ra.log >> ;; >> lio) >> >> configfs_path="/sys/kernel/config/target/iscsi/${OCF_RESKEY_target_iqn}/tpgt >> _1/lun/lun_${OCF_RESKEY_lun}/${OCF_RESOURCE_INSTANCE}/udev_path" >> --- >> >> And over the weekend I got a hit from this. But it only >> failed the first time. The output from iscsi-ra.log: >> --- >> Sun Aug 4 10:54:41 CDT 2013 first LUN failure: /dev/stor01/vm-www01 >> Target 1: iqn.2013-04.net.bitgnome:vh-storage01 >> System information: >> Driver: iscsi >> State: ready >> I_T nexus information: >> I_T nexus: 17 >> Initiator: iqn.1994-05.com.redhat:b8998f3aaa11 >> Connection: 0 >> IP Address: 172.16.165.18 >> I_T nexus: 18 >> Initiator: iqn.1994-05.com.redhat:36ad8852a96d >> Connection: 0 >> IP Address: 172.16.165.19 >> I_T nexus: 19 >> Initiator: iqn.1994-05.com.redhat:28d6b194ab >> Connection: 0 >> IP Address: 172.16.165.20 >> I_T nexus: 20 >> Initiator: iqn.1994-05.com.redhat:bc9afc47c4 >> Connection: 0 >> IP Address: 172.16.165.21 >> LUN information: >> LUN: 0 >> Type: controller >> SCSI ID: IET 00010000 >> SCSI SN: beaf10 >> Size: 0 MB, Block size: 1 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: null >> Backing store path: None >> Backing store flags: >> LUN: 1 >> Type: disk >> SCSI ID: lun1 >> SCSI SN: (stdin)= >> Size: 10737 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-ldap1 >> Backing store flags: >> LUN: 2 >> Type: disk >> SCSI ID: lun2 >> SCSI SN: (stdin)= >> Size: 10737 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-arcgis >> Backing store flags: >> LUN: 3 >> Type: disk >> SCSI ID: lun3 >> SCSI SN: (stdin)= >> Size: 10737 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-mail1 >> Backing store flags: >> LUN: 4 >> Type: disk >> SCSI ID: lun4 >> SCSI SN: (stdin)= >> Size: 10737 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-mail2 >> Backing store flags: >> LUN: 5 >> Type: disk >> SCSI ID: lun5 >> SCSI SN: (stdin)= >> Size: 10737 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-wp2 >> Backing store flags: >> LUN: 6 >> Type: disk >> SCSI ID: lun6 >> SCSI SN: (stdin)= >> Size: 10737 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-ldap-slave1 >> Backing store flags: >> LUN: 7 >> Type: disk >> SCSI ID: lun7 >> SCSI SN: (stdin)= >> Size: 10737 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-ldap-slave2 >> Backing store flags: >> LUN: 8 >> Type: disk >> SCSI ID: lun8 >> SCSI SN: (stdin)= >> Size: 10737 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-ldap-slave3 >> Backing store flags: >> LUN: 9 >> Type: disk >> SCSI ID: lun9 >> SCSI SN: (stdin)= >> Size: 10737 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-wp1 >> Backing store flags: >> LUN: 10 >> Type: disk >> SCSI ID: lun10 >> SCSI SN: (stdin)= >> Size: 10737 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-support >> Backing store flags: >> LUN: 11 >> Type: disk >> SCSI ID: lun11 >> SCSI SN: (stdin)= >> Size: 10737 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-cache1 >> Backing store flags: >> LUN: 12 >> Type: disk >> SCSI ID: lun12 >> SCSI SN: (stdin)= >> Size: 10737 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-cache2 >> Backing store flags: >> LUN: 13 >> Type: disk >> SCSI ID: lun13 >> SCSI SN: (stdin)= >> Size: 10737 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-proxy >> Backing store flags: >> LUN: 14 >> Type: disk >> SCSI ID: lun14 >> SCSI SN: (stdin)= >> Size: 53687 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-pcspine >> Backing store flags: >> LUN: 15 >> Type: disk >> SCSI ID: lun15 >> SCSI SN: (stdin)= >> Size: 53687 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-print >> Backing store flags: >> LUN: 16 >> Type: disk >> SCSI ID: lun16 >> SCSI SN: (stdin)= >> Size: 53687 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-ad >> Backing store flags: >> LUN: 17 >> Type: disk >> SCSI ID: lun17 >> SCSI SN: (stdin)= >> Size: 53687 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-pcbrain >> Backing store flags: >> LUN: 18 >> Type: disk >> SCSI ID: lun18 >> SCSI SN: (stdin)= >> Size: 10737 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-xmpp >> Backing store flags: >> LUN: 19 >> Type: disk >> SCSI ID: lun19 >> SCSI SN: (stdin)= >> Size: 10737 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-pma >> Backing store flags: >> LUN: 20 >> Type: disk >> SCSI ID: lun20 >> SCSI SN: (stdin)= >> Size: 10737 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-cake >> Backing store flags: >> LUN: 21 >> Type: disk >> SCSI ID: lun21 >> SCSI SN: (stdin)= >> Size: 10737 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-ica-file >> Backing store flags: >> LUN: 22 >> Type: disk >> SCSI ID: lun22 >> SCSI SN: (stdin)= >> Size: 10737 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-liwc >> Backing store flags: >> LUN: 23 >> Type: disk >> SCSI ID: lun23 >> SCSI SN: (stdin)= >> Size: 10737 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-lasso >> Backing store flags: >> LUN: 24 >> Type: disk >> SCSI ID: lun24 >> SCSI SN: (stdin)= >> Size: 10737 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-qt >> Backing store flags: >> LUN: 25 >> Type: disk >> SCSI ID: lun25 >> SCSI SN: (stdin)= >> Size: 10737 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-public >> Backing store flags: >> LUN: 26 >> Type: disk >> SCSI ID: lun26 >> SCSI SN: (stdin)= >> Size: 10737 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-source >> Backing store flags: >> LUN: 27 >> Type: disk >> SCSI ID: lun27 >> SCSI SN: (stdin)= >> Size: 10737 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-gmc >> Backing store flags: >> LUN: 28 >> Type: disk >> SCSI ID: lun28 >> SCSI SN: (stdin)= >> Size: 21475 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-solr >> Backing store flags: >> LUN: 29 >> Type: disk >> SCSI ID: lun29 >> SCSI SN: (stdin)= >> Size: 10737 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-license >> Backing store flags: >> LUN: 30 >> Type: disk >> SCSI ID: lun30 >> SCSI SN: (stdin)= >> Size: 10737 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-media >> Backing store flags: >> LUN: 31 >> Type: disk >> SCSI ID: lun31 >> SCSI SN: (stdin)= >> Size: 10737 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-opera >> Backing store flags: >> LUN: 32 >> Type: disk >> SCSI ID: lun32 >> SCSI SN: (stdin)= >> Size: 21475 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-asl >> Backing store flags: >> LUN: 33 >> Type: disk >> SCSI ID: lun33 >> SCSI SN: (stdin)= >> Size: 10737 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-daseupload >> Backing store flags: >> LUN: 34 >> Type: disk >> SCSI ID: lun34 >> SCSI SN: (stdin)= >> Size: 21475 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-arcsde >> Backing store flags: >> LUN: 35 >> Type: disk >> SCSI ID: lun35 >> SCSI SN: (stdin)= >> Size: 21475 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-switchwitch >> Backing store flags: >> LUN: 36 >> Type: disk >> SCSI ID: lun36 >> SCSI SN: (stdin)= >> Size: 10737 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-matlab >> Backing store flags: >> LUN: 37 >> Type: disk >> SCSI ID: lun37 >> SCSI SN: (stdin)= >> Size: 21475 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-spintx >> Backing store flags: >> LUN: 38 >> Type: disk >> SCSI ID: lun38 >> SCSI SN: (stdin)= >> Size: 21475 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-atlassian >> Backing store flags: >> LUN: 39 >> Type: disk >> SCSI ID: lun39 >> SCSI SN: (stdin)= >> Size: 10737 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-test3 >> Backing store flags: >> LUN: 40 >> Type: disk >> SCSI ID: lun40 >> SCSI SN: (stdin)= >> Size: 10737 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-nfs >> Backing store flags: >> LUN: 41 >> Type: disk >> SCSI ID: lun41 >> SCSI SN: (stdin)= >> Size: 21475 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-test4 >> Backing store flags: >> LUN: 42 >> Type: disk >> SCSI ID: lun42 >> SCSI SN: (stdin)= >> Size: 21475 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-bamboo >> Backing store flags: >> LUN: 43 >> Type: disk >> SCSI ID: lun43 >> SCSI SN: (stdin)= >> Size: 21475 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-wowza-test >> Backing store flags: >> LUN: 44 >> Type: disk >> SCSI ID: lun44 >> SCSI SN: (stdin)= >> Size: 53687 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-abman-dev >> Backing store flags: >> LUN: 45 >> Type: disk >> SCSI ID: lun45 >> SCSI SN: (stdin)= >> Size: 10737 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-workflow >> Backing store flags: >> LUN: 46 >> Type: disk >> SCSI ID: lun46 >> SCSI SN: (stdin)= >> Size: 10737 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent removal: No >> Readonly: No >> Backing store type: rdwr >> Backing store path: /dev/stor01/vm-psyimage >> Backing store flags: >> LUN: 47 >> Type: disk >> SCSI ID: lun47 >> SCSI SN: (stdin)= >> Size: 21475 MB, Block size: 512 >> Online: Yes >> Removable media: No >> Prevent >> --- >> >> So it clearly got incomplete output from tgtadm the first >> time and successfully retrieved all the information the second >> time before it returned a return code of 7. I found where tgtd >> would crash with more than 40 LUN's being discussed back in 2008: >> --- >> http://lists.wpkg.org/pipermail/stgt/2008-December/002528.html >> >> But I couldn't find anything else related to this problem >> specifically. >> >> Has anyone else seen weirdness like this from tgtd? I >> assume the "easy" answer is switch to a newer distribution with >> LIO. Or just keep the multiple checks in place to workaround the >> problem. >> >> -- >> Mark Nipper >> [email protected] (XMPP) >> +1 979 575 3193 >> - >> In theory there is no difference between theory and practice. In >> practice there is. >> _______________________________________________ >> Linux-HA mailing list >> [email protected] >> http://lists.linux-ha.org/mailman/listinfo/linux-ha >> See also: http://linux-ha.org/ReportingProblems > > > _______________________________________________ > Linux-HA mailing list > [email protected] > http://lists.linux-ha.org/mailman/listinfo/linux-ha > See also: http://linux-ha.org/ReportingProblems _______________________________________________ Linux-HA mailing list [email protected] http://lists.linux-ha.org/mailman/listinfo/linux-ha See also: http://linux-ha.org/ReportingProblems
