Re: [Linux-HA] pacemaker with heartbeat on Debian Wheezy reboots the node reproducable when putting into maintance mode because of a /usr/lib/heartbeat/crmd crash

2013-08-07 Thread Thomas Glanzmann
Hello Andrew,

 I can try and fix that if you re-run with -x and paste the output.

(apache-03) [~] crm_report -l /var/adm/syslog/2013/08/05 -f 2013-08-04 
18:30:00 -t 2013-08-04 19:15 -x
+ shift
+ true
+ [ ! -z ]
+ break
+ [ x != x ]
+ [ x1375633800 != x ]
+ masterlog=
+ [ -z  ]
+ log WARNING: The tarball produced by this program may contain
+ printf %-10s  WARNING: The tarball produced by this program may contain\n 
apache-03:
apache-03:  WARNING: The tarball produced by this program may contain
+ log  sensitive information such as passwords.
+ printf %-10s   sensitive information such as passwords.\n apache-03:
apache-03:   sensitive information such as passwords.
+ log
+ printf %-10s  \n apache-03:
apache-03:
+ log We will attempt to remove such information if you use the
+ printf %-10s  We will attempt to remove such information if you use the\n 
apache-03:
apache-03:  We will attempt to remove such information if you use the
+ log -p option. For example: -p pass.* -p user.*
+ printf %-10s  -p option. For example: -p pass.* -p user.*\n apache-03:
apache-03:  -p option. For example: -p pass.* -p user.*
+ log
+ printf %-10s  \n apache-03:
apache-03:
+ log However, doing this may reduce the ability for the recipients
+ printf %-10s  However, doing this may reduce the ability for the recipients\n 
apache-03:
apache-03:  However, doing this may reduce the ability for the recipients
+ log to diagnose issues and generally provide assistance.
+ printf %-10s  to diagnose issues and generally provide assistance.\n 
apache-03:
apache-03:  to diagnose issues and generally provide assistance.
+ log
+ printf %-10s  \n apache-03:
apache-03:
+ log IT IS YOUR RESPONSIBILITY TO PROTECT SENSITIVE DATA FROM EXPOSURE
+ printf %-10s  IT IS YOUR RESPONSIBILITY TO PROTECT SENSITIVE DATA FROM 
EXPOSURE\n apache-03:
apache-03:  IT IS YOUR RESPONSIBILITY TO PROTECT SENSITIVE DATA FROM EXPOSURE
+ log
+ printf %-10s  \n apache-03:
apache-03:
+ [ -z  ]
+ getnodes any
+ [ -z any ]
+ cluster=any
+ [ -z ]
+ HA_STATE_DIR=/var/lib/heartbeat
+ find_cluster_cf any
+ warning Unknown cluster type: any
+ log WARN: Unknown cluster type: any
+ printf %-10s  WARN: Unknown cluster type: any\n apache-03:
apache-03:  WARN: Unknown cluster type: any
+ cluster_cf=
+ ps -ef
+ egrep -qs [c]ib
+ debug Querying CIB for nodes
+ [ 0 -gt 0 ]
+ cibadmin -Ql -o nodes
+ awk
  /type=normal/ {
for( i=1; i=NF; i++ )
if( $i~/^uname=/ ) {
sub(uname=.,,$i);
sub(\.*,,$i);
print $i;
next;
}
  }

+ tr \n
+ nodes=apache-03 apache-04
+ log Calculated node list: apache-03 apache-04
+ printf %-10s  Calculated node list: apache-03 apache-04 \n apache-03:
apache-03:  Calculated node list: apache-03 apache-04
+ [ -z apache-03 apache-04  ]
+ echo apache-03 apache-04
+ grep -qs apache-03
+ debug We are a cluster node
+ [ 0 -gt 0 ]
+ [ -z 1375636500 ]
+ date +%a-%d-%b-%Y
+ label=pcmk-Wed-07-Aug-2013
+ time2str 1375633800
+ perl -e use POSIX; print strftime('%x %X',localtime(1375633800));
+ time2str 1375636500
+ perl -e use POSIX; print strftime('%x %X',localtime(1375636500));
+ log Collecting data from apache-03 apache-04  (08/04/13 18:30:00 to 08/04/13 
19:15:00)
+ printf %-10s  Collecting data from apache-03 apache-04  (08/04/13 18:30:00 to 
08/04/13 19:15:00)\n apache-03:
apache-03:  Collecting data from apache-03 apache-04  (08/04/13 18:30:00 to 
08/04/13 19:15:00)
+ collect_data pcmk-Wed-07-Aug-2013 1375633800 1375636500
+ label=pcmk-Wed-07-Aug-2013
+ expr 1375633800 - 10
+ start=1375633790
+ expr 1375636500 + 10
+ end=1375636510
+ masterlog=
+ [ x != x ]
+ l_base=/home/tg/pcmk-Wed-07-Aug-2013
+ r_base=pcmk-Wed-07-Aug-2013
+ [ -e /home/tg/pcmk-Wed-07-Aug-2013 ]
+ mkdir -p /home/tg/pcmk-Wed-07-Aug-2013
+ [ x != x ]
+ cat
+ [ apache-03 = apache-03 ]
+ cat
+ cat /home/tg/pcmk-Wed-07-Aug-2013/.env /usr/share/pacemaker/report.common 
/usr/share/pacemaker/report.collector
+ bash /home/tg/pcmk-Wed-07-Aug-2013/collector
apache-03:  ERROR: Could not determine the location of your cluster logs, try 
specifying --logfile /some/path
+ cat
+ [ apache-03 = apache-04 ]
+ cat /home/tg/pcmk-Wed-07-Aug-2013/.env /usr/share/pacemaker/report.common 
/usr/share/pacemaker/report.collector
+ ssh+  -l root -T apache-04 -- mkdir -p pcmk-Wed-07-Aug-2013; cat  
pcmk-Wed-07-Aug-2013/collector; bash pcmk-Wed-07-Aug-2013/collectorcd
 /home/tg/pcmk-Wed-07-Aug-2013
+ tar xf -
apache-04:  ERROR: Could not determine the location of your cluster logs, try 
specifying --logfile /some/path
tar: This does not look like a tar archive
tar: Exiting with failure status due to previous errors
+ analyze /home/tg/pcmk-Wed-07-Aug-2013
+ flist=hostcache members.txt cib.xml crm_mon.txt  logd.cf sysinfo.txt
+ printf Diff hostcache...
+ ls /home/tg/pcmk-Wed-07-Aug-2013/*/hostcache
+ echo no 

Re: [Linux-HA] Wheezy / heartbeat / pacemaker: Howto make persistent configuration changes

2013-08-07 Thread Thomas Glanzmann
Hello Andrew,

 As I said The cluster only stops doing this if writing to disk fails
 at some point - but there would have been an error in your logs if
 that were the case.

I grepped in the logs and found out that there was a write error on 15
Juli and probably all changes after that did not went to the disk.

(apache-03) [/var/adm/syslog/2013] grep 'Disk write failed' ??/??/*
07/15/daemon:Jul 15 17:55:04 apache-03 cib: [29394]: ERROR: 
cib_diskwrite_complete: Disk write failed: status=134, signo=6, exitcode=0
07/15/daemon:Jul 15 17:55:04 172.19.0.2 cib: [23106]: ERROR: 
cib_diskwrite_complete: Disk write failed: status=134, signo=6, exitcode=0
08/04/daemon:Aug  4 19:03:55 apache-03 cib: [3226]: ERROR: 
cib_diskwrite_complete: Disk write failed: status=134, signo=6, exitcode=0
08/04/daemon:Aug  4 19:03:56 apache-04-intern cib: [3197]: ERROR: 
cib_diskwrite_complete: Disk write failed: status=134, signo=6, exitcode=0

And it looks like the reason for that was not a bad disk, but a failure
in another component:

Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - cib admin_epoch=0 
epoch=19 num_updates=3 
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: -   configuration 
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - resources 
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: -   group id=nfs 

Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - primitive 
id=gcl_fs 
Jul 15 17:55:04 apache-03 crmd: [29398]: info: abort_transition_graph: 
te_update_diff:126 - Triggered transition abort (complete=1, tag=diff, 
id=(null), magic=NA, cib=0.20.1) : Non-status change
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: -   
meta_attributes id=gcl_fs-meta_attributes 
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - nvpair 
id=gcl_fs-meta_attributes-target-role name=target-role value=Started 
__crm_diff_marker__=removed:top /
Jul 15 17:55:04 apache-03 crmd: [29398]: notice: do_state_transition: State 
transition S_IDLE - S_POLICY_ENGINE [ input=I_PE_CALC cause=C_FSA_INTERNAL 
origin=abort_transition_graph ]
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: -   
/meta_attributes
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - /primitive
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - 
meta_attributes id=nfs-meta_attributes 
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: -   nvpair 
value=Stopped id=nfs-meta_attributes-target-role /
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - 
/meta_attributes
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: -   /group
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - /resources
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: -   /configuration
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: - /cib
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: + cib epoch=20 
num_updates=1 admin_epoch=0 validate-with=pacemaker-1.2 
crm_feature_set=3.0.6 update-origin=apache-03 update-client=cibadmin 
cib-last-written=Mon Jul 15 16:02:23 2013 have-quorum=1 
dc-uuid=61e8f424-b538-4352-b3fe-955ca853e5fb 
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: +   configuration 
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: + resources 
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: +   group id=nfs 

Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: + 
meta_attributes id=nfs-meta_attributes 
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: +   nvpair 
id=nfs-meta_attributes-target-role name=target-role value=Started /
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: + 
/meta_attributes
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: +   /group
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: + /resources
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: +   /configuration
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib:diff: + /cib
Jul 15 17:55:04 apache-03 cib: [29394]: info: cib_process_request: Operation 
complete: op cib_replace for section resources (origin=local/cibadmin/2, 
version=0.20.1): ok (rc=0)
Jul 15 17:55:04 apache-03 cib: [3583]: ERROR: validate_cib_digest: Digest 
comparision failed: expected 976068d203615e656547fdf60190ad16 
(/var/lib/heartbeat/crm/cib.b9SItG), calculated 3f273f2cf3f97c0c02be83555ecabf0d
Jul 15 17:55:04 apache-03 cib: [3583]: ERROR: retrieveCib: Checksum of 
/var/lib/heartbeat/crm/cib.p3FraX failed!  Configuration contents ignored!
Jul 15 17:55:04 apache-03 cib: [3583]: ERROR: retrieveCib: Usually this is 
caused by manual changes, please refer to 
http://clusterlabs.org/wiki/FAQ#cib_changes_detected
Jul 15 17:55:04 apache-03 cib: [3583]: ERROR: crm_abort: write_cib_contents: 
Triggered fatal assert at io.c:662 : retrieveCib(tmp1, tmp2, FALSE) != NULL
Jul 15 17:55:04 apache-03 pengine: [29464]: notice: LogActions: Start   
nfs-common  

Re: [Linux-HA] pacemaker with heartbeat on Debian Wheezy reboots the node reproducable when putting into maintance mode because of a /usr/lib/heartbeat/crmd crash

2013-08-07 Thread Thomas Glanzmann
Hello Andrew,

 It really helps to read the output of the commands you're running:

 Did you not see these messages the first time?

 apache-03:  WARN: Unknown cluster type: any
 apache-03:  ERROR: Could not determine the location of your cluster logs, try 
 specifying --logfile /some/path
 apache-04:  ERROR: Could not determine the location of your cluster logs, try 
 specifying --logfile /some/path

 Try adding -H and --logfile {somevalue} next time.

I'll do that and report back.

 An updated pacemaker is the important part. Whether you switch to
 corosync too is up to you.

I'll do that.

 Pacemaker+heartbeat is by far the least tested combination.

What is the best tested combination? Pacemaker and corosync? Any
specific version or should I go with the lastest release of both?

 Best to poke the debian maintainers

I'll do that as well.

 Do you mean See that the monitors _work, then_ take the system out of
 maintance mode...?  If so, then yes.

Yes, that is what I want to do. :-)

Cheers,
Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Wheezy / heartbeat / pacemaker: Howto make persistent configuration changes

2013-08-05 Thread Thomas Glanzmann
Hello Andrew,

 Any change to the configuration section is automatically written to
 disk.  The cluster only stops doing this if writing to disk fails at
 some point - but there would have been an error in your logs if that
 were the case.

than I do not get it. Yesterday, when the nodes sucided itself I lost 24
hours of configuration, so I looked in /var/lib/heartbeat/crm and there
was no XML file and I changed the configuration many times, but three
resource groups were gone:

apache-03-fencing   (stonith:external/ipmi):Started apache-04
apache-04-fencing   (stonith:external/ipmi):Started apache-03
 Resource Group: routing
 router_ipv4(ocf::heartbeat:IPaddr2):   Started apache-03
 router_ipv6(ocf::heartbeat:IPv6addr):  Started apache-03
 openvpn_ipv4   (ocf::heartbeat:IPaddr2):   Started apache-03
 router_ipv6_transfer   (ocf::heartbeat:IPv6addr):  Started 
apache-03
 openvpn_glanzmann  (ocf::heartbeat:openvpn):   Started apache-03
 openvpn_ipxechange (ocf::heartbeat:openvpn):   Started apache-03
 openvpn_eclogic(ocf::heartbeat:openvpn):   Started apache-03
 openvpn_einwahl(ocf::heartbeat:openvpn):   Started apache-03
 Resource Group: nfs
 gcl_fs (ocf::heartbeat:Filesystem):Started apache-04
 nfs-common (ocf::heartbeat:nfs-common):Started apache-04
 nfs-kernel-server  (ocf::heartbeat:nfs-kernel-server): Started 
apache-04
 nfs_ipv4   (ocf::heartbeat:IPaddr2):   Started apache-04
 Master/Slave Set: ma-ms-drbd0 [drbd0]
 Masters: [ apache-04 ]
 Slaves: [ apache-03 ]
 Resource Group: apache
 eccar_ipv4 (ocf::heartbeat:IPaddr2):   Started apache-04
 apache_loadbalancer(lsb:apache2):  Started apache-04
 Master/Slave Set: ma-ms-drbd1 [drbd1]
 Masters: [ apache-04 ]
 Slaves: [ apache-03 ]
 Resource Group: mail
 postfix_fs (ocf::heartbeat:Filesystem):Started apache-04
 postfix_ipv4   (ocf::heartbeat:IPaddr2):   Started apache-04
 spamass(lsb:spamass-milter):   Started apache-04
 clamav (lsb:clamav-daemon):Started apache-04
 postgrey   (lsb:postgrey): Started apache-04
 dovecot(lsb:dovecot):  Started apache-04
 postfix(ocf::heartbeat:postfix):   Started apache-04

This is my cluster, and the mail group was gone, the drbd1 was gone,
apache was gone and some resources of the routing group were missing,
all the changes were commited in the last 24 hours, after the suicide a
grep in the /var/lib/heartbeat/crm and they were not saved.

Now I rebooted both nodes and manually exported it to be on the very
safe side.

I'll collect the log files and provide them crm_report doesn't work for
me probably because my syslog location is non default.

Cheers,
Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Wheezy / heartbeat / pacemaker: Howto make persistent configuration changes

2013-08-05 Thread Thomas Glanzmann
Hello Andrew,

 did they ensure everything was flushed to disk first? 

(apache-03) [/var] cat /proc/sys/vm/dirty_expire_centisecs
3000

So dirty data should be flushed within 3 seconds. But I lost at least 24
hours maybe even more. So it seems that pacemaker / heartbeat does not
do persistant changes when I changed the config, which is strange but
I'll try to reproduce that in a lab, too.

 thats not where recent versions of pacemaker keep the cib by default.
 check /var/lib/pacemaker/cib too

The directory does not exist. I'll provide you with the logs this
evening.

Cheers,
Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] pacemaker with heartbeat on Debian Wheezy reboots the node reproducable when putting into maintance mode because of a /usr/lib/heartbeat/crmd crash

2013-08-05 Thread Thomas Glanzmann
Hello Andrew,

 You will need to run crm_report and email us the resulting tarball.
 This will include the version of the software you're running and log
 files (both system and cluster) - without which we can't do anything.

Find the files here:

I manually packaged it because crm_report output was empty. If I forget
something, please let me know. I included the daemon syslog output from
both nodes from the central syslog server and the crm file, the ha.cf
which is the same on both nodes and the /var/lib/heartbeat directory
which seems to keep all files from the first node.

The reason for the crash in unmanaged mode seems to be the same as
before:

Aug  4 18:50:27 apache-03 crmd: [29398]: ERROR: crm_abort: 
abort_transition_graph: Triggered assert at te_utils.c:339 : transition_graph 
!= NULL

Probably I should update it.

But why the config got lost, I have no idea what went wrong here.

https://thomas.glanzmann.de/tmp/linux_ha_crash.2013-08-05.tar.gz

Cheers,
Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: pacemaker with heartbeat on Debian Wheezy reboots the node reproducable when putting into maintance mode because of a /usr/lib/heartbeat/crmd crash

2013-08-05 Thread Thomas Glanzmann
Hello Ulrich,

 Did it happen when you put the cluster into maintenance-mode, or did
 it happen after someone fiddled with the resources manually? Or did it
 happen when you turned maintenance-mode off again?

I did not remember, but checked the log files, and yes I did a config
change (I removed apache_loadbalancer from group apache). And that is probably
the reason I could not reproduce it in my lab environemnt because I never tried
to fiddle with it afterwards.. Probably the way to reproduce it is: put it to
maintance-mode and than change something to the config and it crashes, but I
have to verify that in my lab and report back. I'll do that right now and
report back.

...
Aug  4 18:49:18 apache-03 cib: [29394]: info: cib:diff: +   configuration 
Aug  4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + crm_config 
Aug  4 18:49:18 apache-03 cib: [29394]: info: cib:diff: +   
cluster_property_set id=cib-bootstrap-options 
Aug  4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + nvpair 
id=cib-bootstrap-options-maintenance-mode name=maintenance-mode 
value=true __crm_diff_marker__=added:top /
Aug  4 18:49:18 apache-03 cib: [29394]: info: cib:diff: +   
/cluster_property_set
Aug  4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + /crm_config
Aug  4 18:49:18 apache-03 cib: [29394]: info: cib:diff: +   /configuration
Aug  4 18:49:18 apache-03 cib: [29394]: info: cib:diff: + /cib
Aug  4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - cib admin_epoch=0 
epoch=94 num_updates=100 
Aug  4 18:50:20 apache-03 cib: [29394]: info: cib:diff: -   configuration 
Aug  4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - resources 
Aug  4 18:50:20 apache-03 cib: [29394]: info: cib:diff: -   group 
id=apache 
Aug  4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - primitive 
class=ocf id=apache_loadbalancer provider=heartbeat type=apachetg 
__crm_diff_marker__=removed:top 
Aug  4 18:50:20 apache-03 cib: [29394]: info: cib:diff: -   operations 

Aug  4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - op 
id=apache_loadbalancer-monitor-60s interval=60s name=monitor /
Aug  4 18:50:20 apache-03 cib: [29394]: info: cib:diff: -   
/operations
Aug  4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - /primitive
Aug  4 18:50:20 apache-03 cib: [29394]: info: cib:diff: -   /group
Aug  4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - /resources
Aug  4 18:50:20 apache-03 cib: [29394]: info: cib:diff: -   /configuration
Aug  4 18:50:20 apache-03 cib: [29394]: info: cib:diff: - /cib
Aug  4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + cib epoch=95 
num_updates=1 admin_epoch=0 validate-with=pacemaker-1.2 
crm_feature_set=3.0.6 update-origin=apache-03 update-client=cibadmin 
cib-last-written=Sun Aug  4
 18:49:18 2013 have-quorum=1 dc-uuid=61e8f424-b538-4352-b3fe-955ca853e5fb 
Aug  4 18:50:20 apache-03 cib: [29394]: info: cib:diff: +   configuration 
Aug  4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + resources 
Aug  4 18:50:20 apache-03 cib: [29394]: info: cib:diff: +   primitive 
class=ocf id=apache_loadbalancer provider=heartbeat type=apachetg 
__crm_diff_marker__=added:top 
Aug  4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + operations 
Aug  4 18:50:20 apache-03 cib: [29394]: info: cib:diff: +   op 
id=apache_loadbalancer-monitor-60s interval=60s name=monitor /
Aug  4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + /operations
Aug  4 18:50:20 apache-03 cib: [29394]: info: cib:diff: +   /primitive
Aug  4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + /resources
Aug  4 18:50:20 apache-03 cib: [29394]: info: cib:diff: +   /configuration
Aug  4 18:50:20 apache-03 cib: [29394]: info: cib:diff: + /cib
...
Aug  4 18:50:27 apache-03 heartbeat: [29380]: ERROR: Managed 
/usr/lib/heartbeat/crmd process 29398 dumped core

Complete syslog is my other e-mail I just sent to Alan, if you want to
check it.

Cheers,
Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] pacemaker with heartbeat on Debian Wheezy reboots the node reproducable when putting into maintance mode because of a /usr/lib/heartbeat/crmd crash

2013-08-04 Thread Thomas Glanzmann
Hello Andrew,
I just got another crash when putting a node into unmanaged node, this
time it hit me hard:

- Both nodes sucided or snothined each other
- One out of four md devices where detected on both nodes after
  reset.
- Half of the config was gone. Could you help me get to the
  bottom of this?

This was on Debian Wheezy.

Cheers,
Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Wheezy / heartbeat / pacemaker: Howto make persistent configuration changes

2013-08-04 Thread Thomas Glanzmann
Hello,
both nodes of my ha cluster just paniced, afterwards the config was
gone. Is there a command to force heartbeat / pacemaker to write the
config to the disk or do I need to restart heartbeat for persistant
changes. The config was at least 24 hours on the node, but I did not
restart heatbeat on the node. Or should I always (which I now start
doing again) do manual backups such as:

sudo crm configure show  cluster.crm

Cheers,
Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Pacemaker: Only the first DRBD is promoted in a group having multiple filesystems which promote individual drbds

2013-06-16 Thread Thomas Glanzmann
Hello Andrew,

 If you include a crm_report for the scenario you're describing, I can
 take a look.  The config alone does not contain enough information.

I tried to reproduce that on a Debian Wheezy (7.0) in my lab environment
and was unable to do so. I'll soon setup multiple other platforms and
will collect the crm_report if I trigger it again and post it. This is
the second problem I was unable to reproduce in my lab environment. Very
frustrating.

Cheers,
Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] custom script status)

2013-06-07 Thread Thomas Glanzmann
Hello Mitsuo,
from the output you send, you should update because your heartbeat
version looks very very ancient to me. A resource script for heartbeat
always needs at least these 5 operations:

#!/bin/bash

. ${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs

export PID=/var/run/postgrey.pid
export INITSCRIPT=/etc/init.d/postgrey

case  $1 in
start)
${INITSCRIPT} start  /dev/null  exit || exit 1;
;;

stop)
${INITSCRIPT} stop  /dev/null  exit || exit 1;
;;

status)
if [ -f ${PID} ]; then
kill -0 `cat ${PID}`  /dev/null  exit;
fi

exit 1;
;;

monitor)
if [ -f ${PID} ]; then
kill -0 `cat ${PID}`  /dev/null  {
exit 0;
}
fi

exit 7;
;;

meta-data)
cat END
?xml version=1.0?
!DOCTYPE resource-agent SYSTEM ra-api-1.dtd
resource-agent name=postgrey
version1.0/version

longdesc lang=en
OCF Ressource Agent for postgrey.
/longdesc

shortdesc lang=enOCF Ressource Agent for postgrey./shortdesc

actions
action name=start timeout=90 /
action name=stop timeout=100 /
action name=status timeout=60 /
action name=monitor depth=0 timeout=30s interval=10s 
start-delay=10s /
action name=meta-data timeout=5s /
action name=validate-all timeout=20s /
/actions
/resource-agent
END
;;
esac

So start should return 0 when the resource was sucesfully started or already
running, otherwise 1.

Stop should return 0 when the resource was sucesfully stoped or already
stopped, otherwise 1.

Status should return 0 if the resource is running, otherwise 1.

Monitor should check if the resource is properly working and return 0 on
success and 7 on failure.

Meta just returns actions and optional paramters and default timeouts intervals
and monitoring delays.

Cheers,
Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] pacemaker with heartbeat on Debian Wheezy reboots the node reproducable when putting into maintance mode because of a /usr/lib/heartbeat/crmd crash

2013-06-07 Thread Thomas Glanzmann
Hello Andrew,

  Jun  6 10:17:37 astorage1 crmd: [2947]: ERROR: crm_abort: 
  abort_transition_graph: Triggered assert at te_utils.c:339 : 
  transition_graph != NULL

 This is the cause of the coredump.
 What version of pacemaker is this?

1.1.7-1

 Installing pacemaker's debug symbols would also make the stack trace
 more useful.

I'll do that and will get back to you.

I tried to reproduce the issue in my lab by installing two Debian Wheezy
VMs and reconstruct the the network and ha config, but was unable to do
so. What I wonder is that the issue on the production system showed up
multiple times (at least 3 times).

Rolf,
could you please do a apt-get install pacmaker-dev and see if the
backtrace reveals a little bit more?

Cheers,
Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Does drbd need re-start after configuration change ?

2013-06-07 Thread Thomas Glanzmann
Hello Fredrik,

* Fredrik Hudner fredrik.hud...@gmail.com [2013-06-07 14:03]:
 Been trying to figure out if drbd which is monitored by HA, needs a
 restart if you do a configuration change in global_common.conf?

http://www.drbd.org/users-guide/s-reconfigure.html

So you need to issue a 'drbdadm adjust resource'.

Cheers,
Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] pacemaker with heartbeat on Debian Wheezy reboots the node reproducable when putting into maintance mode because of a /usr/lib/heartbeat/crmd crash

2013-06-07 Thread Thomas Glanzmann
Hello Andrew,

  Installing pacemaker's debug symbols would also make the stack trace
  more useful.

we tried to install heartbeat-dev to see more, but there are no
debugging symbols available. Also I tried to reproduce the issue with a
64 bit Debian Wheezy as I used 32 bit before, I was not able to
reproduce the issue. However in the near future I'll setup 6 more Linux
HA clusters using Debian Wheezy, I'll report back if the issue happens
to me again. On the system where I can reproduce the problem, I'll not
do any more experiements because it is about to go into production and
except for the maintance part everything works perfectly fine.

Cheers,
Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] custom script status)

2013-06-07 Thread Thomas Glanzmann
Hello Mitso,

 3.0.4-1.el6   

from the version I see that you're runing RHEL 6. So RHEL uses corosync
or cman but not heartbeat as messaging bus between the nodes. You can
follow this guide and the links in this guide.

http://clusterlabs.org/quickstart-redhat.html

What is annoying from my point of view is that is if I understood
Andrews blog Red Hat has removed the crm shell, so you have to use pcs,
personally I prefer heartbeat and pacemaker, but that with Red Hat it is
a challenge, because than you could use the EPEL repositories, but
they're incompatible with the pacemaker shipped from Red Hat, so you end
up compiling it by yourself, also 2 years back I setup a cluster for
Siemens and I noticed the limitations of corosync. At that time it could
only handle two heartbeat links, however they hopefully have fixed that
by now. I never tried cman with something else than ricci and luci (the
old RHEL clusterstack).

Cheers,
Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Pacemaker: Only the first DRBD is promoted in a group having multiple filesystems which promote individual drbds

2013-06-06 Thread Thomas Glanzmann
Hello,
on Debian Wheezy (7.0) I installed pacemaker with heartbeat. When
putting multiple filesystems which depend on multiple drbd promotions,
only the first drbd is promoted and the group never comes up. However
when the promotions are not based on the individual filesystems but on
the group or probably any single entity all drbds are promoted
correctly. So to summarize:

This only promotes the first drbd and the resource group never starts:

group astorage drbd5_fs drbd8_fs nfs-common nfs-kernel-server astorage_ip
order drbd5_fs_after_drbd5 inf: ma-ms-drbd5:promote drbd5_fs:start
order drbd8_fs_after_drbd8 inf: ma-ms-drbd8:promote drbd8_fs:start
#   ~~

This works:
group astorage drbd5_fs drbd8_fs nfs-common nfs-kernel-server astorage_ip
order drbd5_fs_after_drbd5 inf: ma-ms-drbd5:promote astorage:start
order drbd8_fs_after_drbd8 inf: ma-ms-drbd8:promote astorage:start
#   ~~

I would like to know if that is supposed to happen. If that is the case
I would understand why this is the case. I assume it is a bug, but I'm
not sure.

Complete working config here:

primitive astorage_ip ocf:heartbeat:IPaddr2 \
params ip=10.10.50.32 cidr_netmask=24 nic=bond0.6 \
op monitor interval=60s
primitive astorage1-fencing stonith:external/ipmi \
params hostname=astorage1 ipaddr=10.10.30.21 userid=ADMIN 
passwd=secret \
op monitor interval=60s
primitive astorage2-fencing stonith:external/ipmi \
params hostname=astorage2 ipaddr=10.10.30.22 userid=ADMIN 
passwd=secret \
op monitor interval=60s
primitive astorage_16_ip ocf:heartbeat:IPaddr2 \
params ip=10.10.16.53 cidr_netmask=24 nic=eth0 \
op monitor interval=60s
primitive drbd10 ocf:linbit:drbd \
params drbd_resource=r10 \
op monitor interval=29s role=Master \
op monitor interval=31s role=Slave
primitive drbd10_fs ocf:heartbeat:Filesystem \
params device=/dev/drbd10 directory=/mnt/akvm/nfs fstype=ext4 \
op monitor interval=60s
primitive drbd3 ocf:linbit:drbd \
params drbd_resource=r3 \
op monitor interval=29s role=Master \
op monitor interval=31s role=Slave
primitive drbd4 ocf:linbit:drbd \
params drbd_resource=r4 \
op monitor interval=29s role=Master \
op monitor interval=31s role=Slave
primitive drbd5 ocf:linbit:drbd \
params drbd_resource=r5 \
op monitor interval=29s role=Master \
op monitor interval=31s role=Slave
primitive drbd5_fs ocf:heartbeat:Filesystem \
params device=/dev/drbd5 directory=/mnt/apbuild/astorage/packages 
fstype=ext3 \
op monitor interval=60s
primitive drbd6 ocf:linbit:drbd \
params drbd_resource=r6 \
op monitor interval=29s role=Master \
op monitor interval=31s role=Slave
primitive drbd8 ocf:linbit:drbd \
params drbd_resource=r8 \
op monitor interval=29s role=Master \
op monitor interval=31s role=Slave
primitive drbd8_fs ocf:heartbeat:Filesystem \
params device=/dev/drbd8 directory=/mnt/akvm/vms fstype=ext4 \
op monitor interval=60s
primitive drbd9 ocf:linbit:drbd \
params drbd_resource=r9 \
op monitor interval=29s role=Master \
op monitor interval=31s role=Slave
primitive drbd9_fs ocf:heartbeat:Filesystem \
params device=/dev/drbd9 directory=/exports fstype=ext4 \
op monitor interval=60s
primitive nfs-common ocf:heartbeat:nfs-common \
op monitor interval=60s
primitive nfs-kernel-server ocf:heartbeat:nfs-kernel-server \
op monitor interval=60s
primitive target ocf:heartbeat:target \
op monitor interval=60s
group astorage drbd5_fs drbd8_fs drbd9_fs drbd10_fs nfs-common 
nfs-kernel-server astorage_ip astorage_16_ip target \
meta target-role=Started
ms ma-ms-drbd10 drbd10 \
meta master-max=1 master-node-max=1 clone-max=2 
clone-node-max=1 notify=true target-role=Started
ms ma-ms-drbd3 drbd3 \
meta master-max=1 master-node-max=1 clone-max=2 
clone-node-max=1 notify=true target-role=Started
ms ma-ms-drbd4 drbd4 \
meta master-max=1 master-node-max=1 clone-max=2 
clone-node-max=1 notify=true target-role=Started
ms ma-ms-drbd5 drbd5 \
meta master-max=1 master-node-max=1 clone-max=2 
clone-node-max=1 notify=true target-role=Started
ms ma-ms-drbd6 drbd6 \
meta master-max=1 master-node-max=1 clone-max=2 
clone-node-max=1 notify=true target-role=Started
ms ma-ms-drbd8 drbd8 \
meta master-max=1 master-node-max=1 clone-max=2 
clone-node-max=1 notify=true target-role=Started
ms ma-ms-drbd9 drbd9 \
meta master-max=1 master-node-max=1 clone-max=2 
clone-node-max=1 notify=true target-role=Started
location astorage1-fencing-placement astorage1-fencing -inf: astorage1
location astorage2-fencing-placement astorage2-fencing -inf: astorage2
location cli-standby-astorage 

[Linux-HA] pacemaker with heartbeat on Debian Wheezy reboots the node reproducable when putting into maintance mode because of a /usr/lib/heartbeat/crmd crash

2013-06-06 Thread Thomas Glanzmann
Hello,
over the last couple of days, I setup an active passive nfs server and
iSCSI storage using drbd, pacemaker, heartbeat, lio and nfs kernel
server. While testing cluster I was often setting it to unmanaged using:

crm configure property maintenance-mode=true

Sometimes when I did that, both nodes or the standby node, suicided
itself because /usr/lib/heartbeat/crmd was crashing. I can reproduce the
problem easily. It even happened to me with a two node cluster having no
resources at all. If you need more information, drop me an e-mail.

Highlights of the log:

Jun  6 10:17:37 astorage1 crmd: [2947]: notice: do_state_transition: State 
transition S_POLICY_ENGINE - S_INTEGRATION [ input=I_FAIL cause=C_FSA_INTERNAL 
origin=get_lrm_resource ]
Jun  6 10:17:37 astorage1 crmd: [2947]: ERROR: crm_abort: 
abort_transition_graph: Triggered assert at te_utils.c:339 : transition_graph 
!= NULL
Jun  6 10:17:37 astorage1 heartbeat: [2863]: WARN: Managed 
/usr/lib/heartbeat/crmd process 2947 killed by signal 11 [SIGSEGV - 
Segmentation violation].
Jun  6 10:17:37 astorage1 ccm: [2942]: info: client (pid=2947) removed from ccm
Jun  6 10:17:37 astorage1 heartbeat: [2863]: ERROR: Managed 
/usr/lib/heartbeat/crmd process 2947 dumped core
Jun  6 10:17:37 astorage1 heartbeat: [2863]: EMERG: Rebooting system.  Reason: 
/usr/lib/heartbeat/crmd

See the log:

Jun  6 10:17:22 astorage1 crmd: [2947]: info: do_election_count_vote: Election 
4 (owner: 56adf229-a1a7-4484-8f18-742ddce19db8) lost: vote from astorage2 
(Uptime)
Jun  6 10:17:22 astorage1 crmd: [2947]: notice: do_state_transition: State 
transition S_NOT_DC - S_PENDING [ input=I_PENDING cause=C_FSA_INTERNAL 
origin=do_election_count_vote ]
Jun  6 10:17:27 astorage1 crmd: [2947]: info: update_dc: Set DC to astorage2 
(3.0.6)
Jun  6 10:17:28 astorage1 cib: [2943]: info: cib_process_request: Operation 
complete: op cib_sync for section 'all' (origin=astorage2/crmd/210, 
version=0.9.18): ok (rc=0)
Jun  6 10:17:28 astorage1 attrd: [2946]: notice: attrd_local_callback: Sending 
full refresh (origin=crmd)
Jun  6 10:17:28 astorage1 crmd: [2947]: notice: do_state_transition: State 
transition S_PENDING - S_NOT_DC [ input=I_NOT_DC cause=C_HA_MESSAGE 
origin=do_cl_join_finalize_respond ]
Jun  6 10:17:28 astorage1 attrd: [2946]: notice: attrd_trigger_update: Sending 
flush op to all hosts for: master-drbd3:0 (1)
Jun  6 10:17:28 astorage1 attrd: [2946]: notice: attrd_trigger_update: Sending 
flush op to all hosts for: master-drbd10:0 (1)
Jun  6 10:17:28 astorage1 attrd: [2946]: notice: attrd_trigger_update: Sending 
flush op to all hosts for: master-drbd8:0 (1)
Jun  6 10:17:28 astorage1 attrd: [2946]: notice: attrd_trigger_update: Sending 
flush op to all hosts for: master-drbd6:0 (1)
Jun  6 10:17:28 astorage1 attrd: [2946]: notice: attrd_trigger_update: Sending 
flush op to all hosts for: master-drbd5:0 (1)
Jun  6 10:17:28 astorage1 attrd: [2946]: notice: attrd_trigger_update: Sending 
flush op to all hosts for: master-drbd9:0 (1)
Jun  6 10:17:28 astorage1 attrd: [2946]: notice: attrd_trigger_update: Sending 
flush op to all hosts for: probe_complete (true)
Jun  6 10:17:28 astorage1 attrd: [2946]: notice: attrd_trigger_update: Sending 
flush op to all hosts for: master-drbd4:0 (1)
Jun  6 10:17:30 astorage1 lrmd: [2944]: info: cancel_op: operation monitor[35] 
on astorage2-fencing for client 2947, its parameters: hostname=[astorage2] 
userid=[ADMIN] CRM_meta_timeout=[2] CRM_meta_name=[monitor] passwd=[ADMIN] 
crm_feature_set=[3.0.6] ipaddr=[10.10.30.22] CRM_meta_interval=[6]  
cancelled
Jun  6 10:17:30 astorage1 crmd: [2947]: info: process_lrm_event: LRM operation 
astorage2-fencing_monitor_6 (call=35, status=1, cib-update=0, 
confirmed=true) Cancelled
Jun  6 10:17:30 astorage1 lrmd: [2944]: info: cancel_op: operation monitor[36] 
on drbd10:0 for client 2947, its parameters: drbd_resource=[r10] 
CRM_meta_role=[Slave] CRM_meta_notify_stop_resource=[ ] 
CRM_meta_notify_demote_resource=[ ] CRM_meta_notify_inactive_resource=[drbd10:0 
] CRM_meta_notify_promote_uname=[ ] CRM_meta_timeout=[2] 
CRM_meta_notify_master_uname=[astorage2 ] CRM_meta_name=[monitor] 
CRM_meta_notify_start_resource=[drbd10:0 ] 
CRM_meta_notify_start_uname=[astorage1 ] crm_feature_set=[3.0.6] 
CRM_meta_notify=[true] CRM_meta_notify_promote_resour cancelled
Jun  6 10:17:30 astorage1 crmd: [2947]: info: process_lrm_event: LRM operation 
drbd10:0_monitor_31000 (call=36, status=1, cib-update=0, confirmed=true) 
Cancelled
Jun  6 10:17:30 astorage1 lrmd: [2944]: info: cancel_op: operation monitor[37] 
on drbd3:0 for client 2947, its parameters: drbd_resource=[r3] 
CRM_meta_role=[Slave] CRM_meta_notify_stop_resource=[ ] 
CRM_meta_notify_demote_resource=[ ] CRM_meta_notify_inactive_resource=[drbd3:0 
] CRM_meta_notify_promote_uname=[ ] CRM_meta_timeout=[2] 
CRM_meta_notify_master_uname=[astorage2 ] CRM_meta_name=[monitor] 
CRM_meta_notify_start_resource=[drbd3:0 ] 

Re: [Linux-HA] Pacemaker: Only the first DRBD is promoted in a group having multiple filesystems which promote individual drbds

2013-06-06 Thread Thomas Glanzmann
Hello Emmanuel,

* emmanuel segura emi2f...@gmail.com [2013-06-06 11:12]:
 order drbd_fs_after_drbd inf: ma-ms-drbd5:promote ma-ms-drbd8:promote 
 astorage:start

I can see that you promoted multiple drbds in one line. My config where
I promote them individually also works. However my question, was why is
it not possible to promote on a per filesystem basis. But only when
having multiple drbd promotions in one group.

Cheers,
Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] How to fix ERROR: Cannot chdir to [/var/lib/heartbeat/cores/hacluster]: Permission denied?

2013-06-06 Thread Thomas Glanzmann
Hello Shuwen,

 What functionality of dir /var/lib/heartbeat/cores/hacluster?

if a component of heartbeat crashed, the core files are kept in this
directory to do post portem analysis of the problem.

 How to fix this error print? What is your advice?

Fix the permissions. For me the permissions are:

chown hearbeat user /var/lib/heartbeat/cores/hacluster

For me on Debian that is:

chown hacluster /var/lib/heartbeat/cores/hacluster

Cheers,
Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Failed actions

2013-04-08 Thread Thomas Glanzmann
Hello Andrew,

 In this case, it is the initial monitor (the one that tells pacemaker
 what state the service is in before we try to start anything) that is
 failing.  For the ones returning rc=1, it looks like something was
 wrong but the cluster was able to clean them up (by running stop) and
 start them again.

I see, thanks.

  crm resource cleanup all

 That should work.

I'll try that and report back.

Cheers,
Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Heartbeat IPv6addr OCF

2013-03-24 Thread Thomas Glanzmann
Hello,

 ipv6addr=2600:3c00::0034:c007

from the manpage of ocf_heartbeat_IPv6addr it looks like that you have
to specify the netmask so try:

ipv6addr=2600:3c00::0034:c007/64 assuiming that you're in a /64.

Cheers,
Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Heartbeat IPv6addr OCF

2013-03-24 Thread Thomas Glanzmann
Hello Nick,

 Thanks for the tip, however, it did not work.  That's actually a /116.
 So I put in 2600:3c00::0034:c007/116 and am getting the same
 error.  I requested that it restart the resource as well, just to make
 sure it wasn't the previous error.

now, I had to try it:

node $id=9d9b62d2-405d-459a-a724-cb2643d7d9a1 node-62
primitive ipv6test ocf:heartbeat:IPv6addr \
params ipv6addr=2a01:4f8:bb:400::2/64 \
op monitor interval=15 timeout=15 \
meta target-role=Started
property $id=cib-bootstrap-options \
dc-version=1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff \
cluster-infrastructure=Heartbeat \
stonith-enabled=false

And it works:

(node-62) [~] ifconfig
eth0  Link encap:Ethernet  HWaddr 00:25:90:97:db:b0
  inet addr:10.100.4.62  Bcast:10.100.255.255  Mask:255.255.0.0
  inet6 addr: 2a01:4f8:bb:400:225:90ff:fe97:dbb0/64 Scope:Global
  inet6 addr: fe80::225:90ff:fe97:dbb0/64 Scope:Link
  inet6 addr: 2a01:4f8:bb:400::2/64 Scope:Global
  UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
  RX packets:40345 errors:0 dropped:0 overruns:0 frame:0
  TX packets:10270 errors:0 dropped:0 overruns:0 carrier:0
  collisions:0 txqueuelen:1000
  RX bytes:52540127 (50.1 MiB)  TX bytes:1127817 (1.0 MiB)
  Memory:fb58-fb60

(infra) [~] traceroute 2a01:4f8:bb:400::2
traceroute to 2a01:4f8:bb:400::2 (2a01:4f8:bb:400::2), 30 hops max, 80 byte 
packets
 1  merlin.glanzmann.de (2a01:4f8:bb:4ff::1)  1.413 ms  1.550 ms  1.791 ms
 2  2a01:4f8:bb:400::2 (2a01:4f8:bb:400::2)  0.204 ms  0.202 ms  0.270 ms

Cheers,
Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Heartbeat IPv6addr OCF

2013-03-24 Thread Thomas Glanzmann
Hello Nick,

 Anything I need to do to allow IPv6... or something?

I agree with Greg here. Have you tried setting the address manually?

ip -6 addr add ip/cidr dev eth0
ip -6 addr show dev eth0
ip -6 addr del ip/cidr dev eth0
ip -6 addr show dev eth0

(node-62) [~] ip -6 addr add 2a01:4f8:bb:400::3/64 dev eth0
(node-62) [~] ip -6 addr show dev eth0
2: eth0: BROADCAST,MULTICAST,UP,LOWER_UP mtu 1500 qlen 1000
inet6 2a01:4f8:bb:400::3/64 scope global
   valid_lft forever preferred_lft forever
inet6 2a01:4f8:bb:400::2/64 scope global
   valid_lft forever preferred_lft forever
inet6 2a01:4f8:bb:400:225:90ff:fe97:dbb0/64 scope global dynamic
   valid_lft 2591998sec preferred_lft 604798sec
inet6 fe80::225:90ff:fe97:dbb0/64 scope link
   valid_lft forever preferred_lft forever
(node-62) [~] ip -6 addr del 2a01:4f8:bb:400::3/64 dev eth0
(node-62) [~] ip -6 addr show dev eth0
2: eth0: BROADCAST,MULTICAST,UP,LOWER_UP mtu 1500 qlen 1000
inet6 2a01:4f8:bb:400::2/64 scope global
   valid_lft forever preferred_lft forever
inet6 2a01:4f8:bb:400:225:90ff:fe97:dbb0/64 scope global dynamic
   valid_lft 2591990sec preferred_lft 604790sec
inet6 fe80::225:90ff:fe97:dbb0/64 scope link
   valid_lft forever preferred_lft forever

Do you see a link local address on your eth0? A link local address is one that
starts with fe80:: otherwise try loading the ipv6 module:

modprobe ipv6 # Don't know if that is the right module name, all my
  # kernels have ipv6 build in (Debian wheezy / squeeze / 
backports)

Cheers,
Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Heartbeat IPv6addr OCF

2013-03-24 Thread Thomas Glanzmann
Hello Nick,

 I shouldn't be able to do that if the IPv6 module wasn't loaded,
 correct?

that is correct. I tried modifying my netmask to copy yours. And I get
the same error, you do:

ipv6test_start_0 (node=node-62, call=6, rc=1, status=complete): unknown 
error

So probably a bug in the resource agent. Manually adding and removing
works:

(node-62) [~] ip -6 addr add 2a01:4f8:bb:400::2/116 dev eth0
(node-62) [~] ip -6 addr show dev eth0
2: eth0: BROADCAST,MULTICAST,UP,LOWER_UP mtu 1500 qlen 1000
inet6 2a01:4f8:bb:400::2/116 scope global
   valid_lft forever preferred_lft forever
inet6 2a01:4f8:bb:400:225:90ff:fe97:dbb0/64 scope global dynamic
   valid_lft 2591887sec preferred_lft 604687sec
inet6 fe80::225:90ff:fe97:dbb0/64 scope link
   valid_lft forever preferred_lft forever
(node-62) [~] ip -6 addr del 2a01:4f8:bb:400::2/116 dev eth0

Nick, you can do the following things to resolve this:

- Hunt down the bug and fix it or let someone else do it for you

- Use another netmask, if possible (fighting the symptoms instead of
  resolving the root cause)

- Write your own resource agent (fighting the symptoms instead of
  resolving the root cause)

Cheers,
Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Failed actions

2013-03-22 Thread Thomas Glanzmann
Hello,
I have an openais installation on centos which has logged failed
actions, but the services appear to be 'started'. As I know heartbeat/pacemaker
if an action fails the service should not be started. I also have
a system on Debian squeeze that stops the service when a monitor action
for IPMI has failed. But I also remember that a few years back the
ability was built in that allows to retry failed actions after a
configurable time, but I never did that. From the out of crm_mon -i 1 -r
I assume that the fence_ipmilan agents are running but that there are
some failed actions. Can I clean them up the old way using

crm_resource -C -r fence-astore1 -H astorage2
crm_resource -C -r fence-astore2 -H astorage1

or

crm resource cleanup all

?

The output of crm_mon is here:

http://pbot.rmdir.de/Qux4BaurFOUOYLfzqJNcfQ

The crm config is here:

http://thomas.glanzmann.de/tmp/crm_config.txt

DRBD config is here:

http://thomas.glanzmann.de/.www/tmp/drbd.txt

Also I would like to know some feedback on the config, I think the
following configuration errors were made:

- Stonith and quorum are disabled

- Promote and colocation constraints for drbd resources and
  fs-storage are missing

- Peer outdater for drbd is missing and suicide is the wrong
  approach for the task at hand.

Cheers,
Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] stonith failed to start

2009-08-20 Thread Thomas Glanzmann
Hello Terry,

 What would cause the stonith 'start' operation to fail after it
 initially had succeeded?

if my understanding is correct (I wrote a stonith agent for vsphere
yesterday). Than it runs the status command of the stonith agent and
looks at the exist status, like that:

(ha-01) [~] VI_SERVER=esx-03.glanzmann.de VI_USERNAME=root 
/usr/lib/stonith/plugins/external/vsphere status; echo $?
Enter password:
0

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] vsphere stonith, squid3 agent for debian lenny and example configuration

2009-08-20 Thread Thomas Glanzmann
Hello,
find attached, a vsphere (works with esx-server 3/4 virual center 2.X
and 4) stonith plugin, a squid3 resource agent for debian lenny and a
example configuration.

Thomas
use_logd yes
bcast eth0
node ha-01 ha-02
watchdog /dev/watchdog
crm on
#!/bin/sh

if [ -z ${OCF_ROOT} ]; then
export OCF_ROOT=/usr/lib/ocf/
fi

. ${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs

SQUID_PORT=3128
INIT_SCRIPT=/etc/init.d/squid3
PID=/var/run/squid3.pid
CHECK_URLS=http://www.google.de/ http://www.glanzmann.de/ 
http://www.uni-erlangen.de;

case  $1 in
start)
if [ -f ${PID} ]; then
kill -0 `cat ${PID}`  /dev/null  exit 0;

else
rm -f ${PID}
fi

${INIT_SCRIPT} start  /dev/null 21  exit || exit 1
;;

stop)
if [ -f ${PID} ]; then
${INIT_SCRIPT} stop  /dev/null 21  exit || exit 1
fi

exit 0;
;;

status)
if [ -f ${PID} ]; then
kill -0 `cat ${PID}`  exit;
fi

exit 1;
;;

monitor)
if [ -f ${PID} ]; then
kill -0 `cat ${PID}` || exit 7
else 
exit 7;
fi

for URL in ${CHECK_URLS}; do
http_proxy=http://localhost:${SQUID_PORT}/ wget -o 
/dev/null -O /dev/null -T 1 -t 1 ${URL}  exit
done

exit 1;
;;

meta-data)
cat END
?xml version=1.0?
!DOCTYPE resource-agent SYSTEM ra-api-1.dtd
resource-agent name=squid
version1.0/version

longdesc lang=en
OCF Ressource Agent on top of squid init script shipped with debian.
/longdesc

shortdesc lang=enOCF Ressource Agent on top of squid init script
shipped with debian./shortdesc

actions
action name=start timeout=90 /
action name=stop timeout=100 /
action name=status timeout=60 /
action name=monitor depth=0 timeout=30s interval=10s start-delay=10s 
/
action name=meta-data timeout=5s /
action name=validate-all timeout=20s /
/actions
/resource-agent
END
;;
esac
#!/usr/bin/perl

use strict;
use warnings FATAL = 'all';

# Thomas Glanzmann 10:28 09-08-19
# apt-get install libarchive-zip-perl libclass-methodmaker-perl 
libcompress-raw-zlib-perl libcompress-zlib-perl libcompress-zlib-perl 
libdata-dump-perl libio-compress-base-perl libio-compress-zlib-perl 
libsoap-lite-perl liburi-perl libuuid-perl libxml-libxml-perl 
libxml-libxml-common-perl  libxml-namespacesupport-perl libwww-perl
# tar xfz ~/VMware-vSphere-SDK-for-Perl-4.0.0-161974.i386.tar.gz
# answer all questins with no

use lib '/usr/lib/vmware-vcli/apps/';
use VMware::VIRuntime;
use AppUtil::VMUtil;

sub
connect
{
Opts::parse();
Opts::validate();
Util::connect();
}

my $vm_views = undef;

sub poweron_vm {
foreach (@$vm_views) {
my $mor_host = $_-runtime-host;
my $hostname = Vim::get_view(mo_ref = $mor_host)-name;
eval {
$_-PowerOnVM();
Util::trace(0, \nvirtual machine ' . $_-name .
' under host $hostname powered on \n);
};
if ($@) {
if (ref($@) eq 'SoapFault') {
Util::trace (0, \nError in ' . $_-name . ' 
under host $hostname: );
if (ref($...@-detail) eq 'NotSupported') {
Util::trace(0,Virtual machine is 
marked as a template );

} elsif (ref($...@-detail) eq 
'InvalidPowerState') {
Util::trace(0, The attempted 
operation.
 cannot be performed 
in the current state );

} elsif (ref($...@-detail) eq 'InvalidState') {
Util::trace(0,Current State of the 
. virtual machine is 
not supported for this operation);

} else {
Util::trace(0, VM '  .$_-name.
' can't be powered on 
\n . $@ .  );
}

} else {
Util::trace(0, VM '  .$_-name.
' can't be powered on \n . $@ 
.  );
}
Util::disconnect();
exit 1;
}
}
}

sub poweroff_vm {
foreach (@$vm_views) {
my $mor_host = $_-runtime-host;
my $hostname = Vim::get_view(mo_ref = $mor_host

Re: [Linux-HA] Automatic Clenaup of certain resources

2008-09-02 Thread Thomas Glanzmann
Hello Andrew,

* Andrew Beekhof [EMAIL PROTECTED] [080117 09:13]:

 On Jan 17, 2008, at 7:34 AM, Thomas Glanzmann wrote:
 I use Linux HA to monitor some services on a dial in machine. A so
 called single node lcuster. For example sometimes my dial-in connection
 or openvpn connection, or IPv6 connectivity does not come. Is there a
 way to tell Linux-HA to retry a failed resource after a certain amount
 of time again?

 not yet but soon

 only in the last few days has the lrmd started exposing the timing
 data required in order to do this

is this possible today? Can someone give me a short walk-through? I had
yesterday a problem where one of my tomcats didn't came because of heavy
load and I had to cleanup the ressource manually. A retry every 10
minutes or every minute would have been sufficient.

What do I have to do to get recent linux-ha packages for Debian Etch?

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Automatic Clenaup of certain resources

2008-09-02 Thread Thomas Glanzmann
Hello Andrew,

  is this possible today?

 yes but only with pacemaker 0.7

thanks a lot I found the configuration option failure-timeout=60s

  Can someone give me a short walk-through?

 Look for Migrating Due to Failure in
http://clusterlabs.org/mw/Image:Configuration_Explained_1.0.pdf

 http://download.opensuse.org/repositories/server:/ha-clustering:/UNSTABLE/Debian_Etch/

 (same heartbeat package as
 http://download.opensuse.org/repositories/server:/ha-clustering/Debian_Etch/
 but also has pacemaker 0.7)

thanks a lot! I try them out and come back with the result.

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Announcement: heartbeat/pacemaker documentation in hg

2008-04-28 Thread Thomas Glanzmann
Hello Dejan,

 http://hg.clusterlabs.org/pacemaker/doc/archive/tip.tar.gz

I am unable to build this:

(ad027088pc) [/var/tmp/Pacemaker-Docs-80da5f68a837] make
/usr/lib/ocf/resource.d/heartbeat/AudibleAlarm: line 19: 
/resource.d/heartbeat/.ocf-shellfuncs: No such file or directory
-:1: parser error : Document is empty

^
-:1: parser error : Start tag expected, '' not found

^
I/O error : Invalid seek
unable to parse -
xml/ra-AudibleAlarm.xml:1: parser error : Document is empty

^
xml/ra-AudibleAlarm.xml:1: parser error : Start tag expected, '' not found
...

are there any precompiled pdfs around?

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] heartbeat failover not working on hard drive error

2008-03-28 Thread Thomas Glanzmann
Hello Coach-X (what a strange name),

 This has happened several times.  Nothing shows up in either log file,
 and a hard reboot brings the master back online.  Is this caused by
 the serial link still being active?  Is there a way to have this type
 of issue cause the slave to become active?

exactly. I personally use 3-ware raid controllers with a raid-1 (mirror)
configured. I monitor these controllers with nagios and switch disks
within 2 days, if one dies. But you could also use a linux software raid
and _sata_ not _pata_ disks to obtain the above. Another way to detect
disk-failures would be a ressource agent who does something like
invalidating the buffer cache and run a find or ls on the filesystem.
And put that resource agent into your group that contains exim.

The monitor action would be something like that

if -f /var/run/ressource-agent is running; then
sync; echo 3  /proc/sys/vm/drop_caches
ls /  /dev/null  exit 0 || exit 1

else
exit 7;
fi

See also http://linux-mm.org/Drop_Caches

I assume you use linux, but if you don't find a reasonable supported
raid controller for your hardware architecture / os.

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] HA maintenance mode

2008-03-28 Thread Thomas Glanzmann
Hello Danny,

 Would be really nice to have that as cluster command in HA or as
 hb_gui feature already available. Or just a switch to enable/disable
 failover for mainteance purpose.

it is already there. It is the default policy. I just don't bother to
look it up in the manual, but maybe you are happy enough that someone
else will raise the word, otherwise you have to look it up by yourself.

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] VLAN Trunk, IPaddr2, and static routes...

2008-03-27 Thread Thomas Glanzmann
Hello Chris,
there is no need to put the vlan logic into the resource agent. Just
configure the interface _before_ and use it _afterwards_. I have it
running for ages on two different machines and it just works.

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] external/ipmi example configuration

2008-03-27 Thread Thomas Glanzmann
Hello Martin,
it is pure luck that I am so bored that I read this list, next time CC
me. :-)

 I have read several postings in the mail archive about the
 external/ipmi configuration but there are still some questions that
 bother me.  The last posting from Thomas: did this cib-configuration
 worked with your 2-node cluster?  I have to configure also 2 nodes and
 would like to use the ipmi-plugin but I am unsure if I understand what
 the plugin really does.

I have the following configuration on two systems and I verified that
this configuration works as it should be. Someone on this list told me
that I can drop the location constraints, however I decided to keep them
until I verified that.

configuration
crm_config
cluster_property_set id=cib-bootstrap-options
attributes
nvpair name=stonith-enabled value=true 
id=stonith-enabled/
nvpair name=stonith-action value=reboot 
id=stonith-action/
/attributes
/cluster_property_set
/crm_config

resources
primitive id=apache-01-fencing class=stonith 
type=external/ipmi provider=heartbeat
operations
op id=apache-01-fencing-monitor 
name=monitor interval=60s timeout=20s prereq=nothing/
op id=apache-01-fencing-start name=start 
timeout=20s prereq=nothing/
/operations

instance_attributes id=ia-apache-01-fencing
attributes
nvpair id=apache-01-fencing-hostname 
name=hostname value=apache-01/
nvpair id=apache-01-fencing-ipaddr 
name=ipaddr value=172.18.0.101/
nvpair id=apache-01-fencing-userid 
name=userid value=Administrator/
nvpair id=apache-01-fencing-passwd 
name=passwd value=whatever/
/attributes
/instance_attributes
/primitive

primitive id=apache-02-fencing class=stonith 
type=external/ipmi provider=heartbeat
operations
op id=apache-02-fencing-monitor 
name=monitor interval=60s timeout=20s prereq=nothing/
op id=apache-02-fencing-start name=start 
timeout=20s prereq=nothing/
/operations

instance_attributes id=ia-apache-02-fencing
attributes
nvpair id=apache-02-fencing-hostname 
name=hostname value=apache-02/
nvpair id=apache-02-fencing-ipaddr 
name=ipaddr value=172.18.0.102/
nvpair id=apache-02-fencing-userid 
name=userid value=Administrator/
nvpair id=apache-02-fencing-passwd 
name=passwd value=whatever/
/attributes
/instance_attributes
/primitive
/resources

constraints
rsc_location id=apache-01-fencing-placement 
rsc=apache-01-fencing
rule id=apache-01-fencing-placement-rule-1 
score=-INFINITY
expression 
id=apache-01-fencing-placement-exp-02 value=apache-02 attribute=#uname 
operation=ne/
/rule
/rsc_location

rsc_location id=apache-02-fencing-placement 
rsc=apache-02-fencing
rule id=apache-02-fencing-placement-rule-1 
score=-INFINITY
expression 
id=apache-02-fencing-placement-exp-02 value=apache-01 attribute=#uname 
operation=ne/
/rule
/rsc_location
/constraints
/configuration

I killed heartbeat with -9 to simulate a node failure.


 To configure the plugin, I will create a resource for every node. This
 means, two additional resources in my cib.xml because I have two
 cluster-nodes.

Correct.

 The attributes (nvpair) define variables for the ipmi-script, e.g.
 hostname...  But what does the constraints tell me? If #uname is not
 equal tovalue then the score ist -INFINITY, i.e. the resource will
 never be started on that node?

you pin ,,apache-01-fencing'' on apache-02 and ,,apache-02-fencing'' on
apache-01. So that the resource that can stonith apache-01 runs on apache-02
and vice versa. Someone stated that heartbeat is able to do a suicide
(stonith itself) but that isn't true at least not via stonith and not in
version 2.1.3. The location constraints seems to be unneccessary because
if the fencing is running on the wrong node and that node misbehaves, it
is restartet on the remainding node and than shoots the misbehaving.
However 

Re: [Linux-HA] Compiling Heartbeat on Solaris10

2008-02-14 Thread Thomas Glanzmann
Hello Ken,

 I am having trouble compiling Heartbeat 2.0.7 on a Solaris 10 system.
 I have tried SunStudio11 and gcc 3.3 and 4.0. Is there any information
 I can read that might help? It's complaining about Gmain_timeout_funcs
 in lib/clplumbing/GSource.c, if anyone has seen that before.

first of all use version 2.1.3. I am going to compile heartbeat for
Solaris myself. But it might take some time. If I am done, I am going to
publish my Solaris packages.

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Removing a node from cluster

2008-02-14 Thread Thomas Glanzmann
Hello Franck,

 Suppose I have a 3 nodes cluster: node1, node2, node3.  I want to
 remove node2 from the cluster to be able to perform various operation
 on the node2 without any risk of ressources moving to node2.  I tried
 to figure out with the cibadmin or crm_ressource but I don't get it.

# Put node into standby mode
crm_standby -U node2 -v on

# Make node active again (the two commands have the same effect)
crm_standby -D -U node2

Or you go to that node and type.

/etc/init.d/heartbeat stop

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] resource script question (runlevel config)

2008-02-14 Thread Thomas Glanzmann
Hello Amy,

 What about something like monit to make sure ssh is up and running and
 restart if it crashes?

thanks for the pointer. A very interesting tool. I was looking for
something like that but decided to write something by myself but it
sounds great maybe I will give it a try.

http://www.tildeslash.com/monit/

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] ClusterIP

2008-02-07 Thread Thomas Glanzmann
Hello,
I would like to do a Cluster-IP Setup with SLES 10. A few things are
unclear for me. With ClusterIP you have one IP address that is shared on
two or more nodes. It useally uses a multicast mac address. Both nodes
see all traffic. But when one node goes down how does the other node see
that it has to handle all the traffic right now and not only a part of
it?

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] ClusterIP

2008-02-07 Thread Thomas Glanzmann
Hello,
thank you a lot for the feedback! Now I understand how the failover
works. Has someone a ready to use cib.xml that I can use for testing. I
am going to try my luck right now and come back in an hour or so with my
findings. It would be nice if someone could comment on them.

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] propagate value similliar to pingd

2008-02-07 Thread Thomas Glanzmann
Hello,
I would like to write a script similiar to pingd that is spawnd and
populates a value in the cib that I can build a rule on. What do I have
to do to obtain the above. Concrete questions are:

- What do I have to put in the cib to spawn such an 'agent'?

- How do I propagate the value into the cib?

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] ClusterIP

2008-02-07 Thread Thomas Glanzmann
Hello again,
here comes by cib.xml for a clusterip. But the ressource stickiness is not
working for me. When I shoutdown ha-2, the two clone instances stay on ha-1.
Any ideas? Before sending this e-mail I used the following command to
set some location constraints:

crm_resource -M -r ip0:0 -H ha-1

of course you can't do that when ip0:0 is already on ha-1 because
crm_resource goes smart ass on me:

(ha-1) [~] crm_resource -M -r ip0:0 -H ha-1
Error performing operation: ip0:0 is already active on ha-1

However it seems that location constraints or preferences are totally
fine with cloned ressources. So it doesn't seem that I do need the
ressource_stickiness. It doesn't work for me anyway. And that answers my
other question, I guess. And the ressources come back on the right host
after I simulate a power failure. Nice feature. I like it very much.

(ha-1) [~] crm_mon -1 -r


Last updated: Thu Feb  7 22:25:12 2008
Current DC: ha-1 (330da1b6-5f99-480a-b071-a144a98e1248)
2 Nodes configured.
1 Resources configured.


Node: ha-2 (095256ab-361c-4b1e-9a8b-8bed74c4a7fb): online
Node: ha-1 (330da1b6-5f99-480a-b071-a144a98e1248): online

Full list of resources:

Clone Set: clusterip-clone
ip0:0   (heartbeat::ocf:IPaddr2):   Started ha-1
ip0:1   (heartbeat::ocf:IPaddr2):   Started ha-1

configuration
crm_config
cluster_property_set id=cib-bootstrap-options
attributes
nvpair name=ressource_stickiness value=0 
id=ressource-stickiness/
/attributes
/cluster_property_set
/crm_config

resources
clone id=clusterip-clone
meta_attributes id=clusterip-clone-ma
attributes
nvpair id=clusterip-clone-1 
name=globally_unique value=false/
nvpair id=clusterip-clone-2 
name=clone_max value=2/
nvpair id=clusterip-clone-3 
name=clone_node_max value=2/
/attributes
/meta_attributes

primitive class=ocf provider=heartbeat 
type=IPaddr2 id=ip0
instance_attributes id=ia-ip0
attributes
nvpair id=ia-ip0-1 name=ip 
value=157.163.248.193/
nvpair id=ia-ip0-2 
name=cidr_netmask value=25/
nvpair id=ia-ip0-3 
name=nic value=eth0/
nvpair id=ia-ip0-4 
name=mac value=01:02:03:04:05:06/
nvpair id=ia-ip0-5 
name=clusterip_hash value=sourceip-sourceport/
/attributes
/instance_attributes
operations
op id=ip0-monitor0 name=monitor 
interval=60s timeout=120s start_delay=1m/
/operations
/primitive
/clone
/resources
/configuration

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] ClusterIP

2008-02-07 Thread Thomas Glanzmann
Hallo Lars

 Uhm, what do you think should happen when you shutdown ha-2 - of
 course they stat on ha-1 in that case?

I meant that I shut it down temporarily and if it comes back again the
clones stay both on one node instead of going back again.

 I don't know what you're saying here ;-)

I said. That ressource_stickiness=0 does not work for me so I used a
location constraint to put ip:0 on ha-1 and ip:1 on ha-2 to get the
behaviour. But as I write this e-mail I realize that I misspelled
resource_stickiness.

  nvpair name=ressource_stickiness 
  value=0 id=ressource-stickiness/

 With resource stickiness, this should be spread across two nodes?

Sure thing, if I manage to write it correctly. :-)

 This setting is wrong. globally_unique must be true for the cluster
 ip.  Your configuration doesn't really work ;-)

Okay. That is the right moment to ask what globally_unique is about
anyway? I never got it. I just copy and pasted it.

  nvpair id=clusterip-clone-2 name=clone_max value=2/

 You can drop this line, it defaults to the number of nodes anyway -
 unless, of course, you want to make it larger so you can do more fine
 grained load control later.

Thanks. I will do that.

  nvpair id=ia-ip0-4 name=mac value=01:02:03:04:05:06/

 That's not a valid multicast MAC.

I see. I thought every mac address that starts with the first bit set to
one is a multicast MAC address. However I used an autogenerated, too.
And I got it working. But only on the same network. It seems that I have
to set a static entry on the default router to really get it working.

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Samba and High Availability

2008-02-07 Thread Thomas Glanzmann
Hello Christopher,

 Everything I have read about samba and HA made it seem like this was
 not possible. Are others doing this too? Can you think of some good
 tests to try to stress it (short of accessing a database or
 something). I imagine a fail-over during a large copy operation would
 fail, and I'll test that tomorrow. But for the moment, I'm just so
 psyched I had to tell somebody, and the dog couldn't care less. ;)

well I am not a dog, but I do in fact care. So could you please
elaborate a bit and post your cib.xml configuration and your ra for
samba?

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] DRBD 8.0 under Debian Etch?

2008-02-06 Thread Thomas Glanzmann
Hello Fabiano,

 Short question: Does anyone here have DRBD8 running with heartbeat
 under Etch?

I do and it works like a charm. Search the archives for the complete
config or drop me an e-mail and I resend it to you with a few things you
should obey I to get a perfect drbd setup.

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] About configuring DRBD v8 on HA v2

2008-01-30 Thread Thomas Glanzmann
Hello Stefano,
it is not possible to configure drbd in a master/slave through the gui.
For a walkthrough use one of the following:

http://article.gmane.org/gmane.linux.highavailability.user/22132

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] DRBD and Pingd

2008-01-25 Thread Thomas Glanzmann
Hello Dominik,

 You can also start pingd from ha.cf with a respawn directive. Just as Steve 
 did it. Works fine here and imho has the advantage of a pingd value being 
 calculated when the constraints are applied (because pingd starts right 
 away and not just when the crm comes alive).

I see. My bad.

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] DRBD and Pingd

2008-01-24 Thread Thomas Glanzmann
Hello Steve,
attached is a working example for a postgres cluster. Put your
filesystem, ip, database thing in a ressource group and drop the
colocation and order constraints or you have to define your order rules
on two directions. See also this thread:

http://article.gmane.org/gmane.linux.highavailability.user/21811

Thomas
use_logd yes
bcast eth1
mcast eth0.2 239.0.0.2 694 1 0
node postgres-01 postgres-02
respawn hacluster /usr/lib/heartbeat/dopd  
apiauth dopd uid=hacluster gid=haclient
watchdog /dev/watchdog
ping 172.17.0.254
crm on


postgres.xml
Description: XML document
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] DRBD and Pingd

2008-01-24 Thread Thomas Glanzmann
Hi Steve,
your cib.xml isn't working because you forget to propagate the pingd
values. You have forgot to add the pingd clone ressource to your
cib.xml. Common mistake I did it once by myself so your scores don't get
propagated:

put that in your ressources section in the cib:

clone id=pingd-clone
meta_attributes id=pingd-clone-ma
attributes
nvpair id=pingd-clone-1 name=globally_unique 
value=false/
/attributes
/meta_attributes

primitive id=pingd-child provider=heartbeat class=ocf 
type=pingd
operations
op id=pingd-child-monitor name=monitor 
interval=20s timeout=60s prereq=nothing/
op id=pingd-child-start name=start 
prereq=nothing/
/operations
instance_attributes id=pingd_inst_attr
attributes
nvpair id=pingd-1 name=dampen value=60s/
nvpair id=pingd-2 name=multiplier 
value=100/
/attributes
/instance_attributes
/primitive
/clone

Doublecheck that the value gets propagated:

(apache-01) [~] cibadmin -Q | grep name=\pingd\ | grep value
 nvpair id=status-f5707ca9-2673-4edb-80e6-d7700efbd7f3-pingd 
name=pingd value=100/
 nvpair id=status-47923b94-150d-45d5-a7f4-01f1aa607484-pingd 
name=pingd value=100/

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] what to do on loss of network

2008-01-24 Thread Thomas Glanzmann
Hello Kettunen,

 I have SLES10 SP1 HA 2.0.8 split site two node cluster and I've
 configured pingd clone resource to make resource location constrains. It
 works very well. My ping node is Iscsi server in third site from where
 cluster node mounts its resource disk. If I disconnect all communication
 paths between nodes active node correctly stops resource because it
 loses also ping node connection. But there is definetly split brain
 going on (both think they are DC).

I think that is okay. If you have a two node cluster, and the two nodes
can't talk for each other for 90 seconds (or whatever the default
timeout is) they assume the DC status. The only way around is to
configure quorum but to be honest I never found out how to configure
quorum. Maybe somone could give me a walkthrough.

I won't have the split brain scenario, because I have two redundant
communications links between my nodes (switch and cross link cable)
maybe I am going to add a serial line, too. But for the time being it is
more than fine.

I also monitor my whole setup using nagios:

- drbd
- if a dc is choosen and if all ressources are online
- heartbeat links

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: FW: [Linux-HA] what to do on loss of network

2008-01-24 Thread Thomas Glanzmann
Hallo Kettunen,

 Correction. I ment to say that splitbrain detection should be done
 when nodes see each other again (even at network level). CRM status
 messages do move when connection between nodes is back, but other node
 don't accept messages from other node.

I aggree with you. They should stop their ressources and renogitiate or
the other way around. But to be honest, I never tried such a situation
but I can easily do.

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] ordering constraints and node crash

2008-01-22 Thread Thomas Glanzmann
Hello Marc,

 If I kill the node hosting postgresr2, postgresr2 migrates to another
 node, but applisr1 and applisr3 aren't restarted. Is it normal ? What
 could I do to solve this ?

the answer to your question is 'resource group'. A resource group is a
container for resources. Every resource in a resource group is started
and stopped in order and they always have to run on the same host.

If you want to build a resource group by yourself, you need order and
colocation constraints but more than the obvious. See this thread:

http://article.gmane.org/gmane.linux.highavailability.user/21811

Example (from my postgres database):

group id=postgres-cluster
primitive class=ocf provider=heartbeat type=Filesystem id=fs0
instance_attributes id=ia-fs0
attributes
nvpair id=ia-fs0-1 name=fstype 
value=ext3/
nvpair name=directory id=ia-fs0-2 
value=/srv/postgres/
nvpair id=ia-fs0-3 name=device 
value=/dev/drbd0/
/attributes
/instance_attributes
operations
op id=fs0-monitor0 name=monitor interval=60s 
timeout=120s start_delay=1m/
/operations
/primitive

primitive class=ocf provider=heartbeat type=IPaddr2 id=ip0
instance_attributes id=ia-ip0
attributes
nvpair id=ia-ip0-1 name=ip 
value=172.17.0.20/
nvpair id=ia-ip0-2 name=cidr_netmask 
value=24/
nvpair id=ia-ip0-3 name=nic 
value=eth0.2/
/attributes
/instance_attributes
operations
op id=ip0-monitor0 name=monitor interval=60s 
timeout=120s start_delay=1m/
/operations
/primitive

primitive class=ocf provider=heartbeat type=pgsql id=pgsql0
instance_attributes id=ia-pgsql0
attributes
nvpair id=ia-pgsql0-1 name=pgctl 
value=/usr/lib/postgresql/8.1/bin/pg_ctl/
nvpair id=ia-pgsql0-2 name=start_opt 
value=--config_file=/srv/postgres/etc/postgresql.conf/
nvpair id=ia-pgsql0-3 name=pgdata 
value=/srv/postgres/data/
nvpair id=ia-pgsql0-4 name=logfile 
value=/srv/postgres/postgresql.log/
/attributes
/instance_attributes
operations
op id=pgsql0-monitor0 name=monitor interval=60s 
timeout=120s start_delay=1m/
op id=pgsql0-start0 name=start timeout=120s 
prereq=nothing/
/operations
/primitive
/group

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] HOWTO: Build a high available iscsi Target using heartbeat, drbd and ietd for ESX Server 3.5

2008-01-21 Thread Thomas Glanzmann
Hello Dejan,

 Nice effort. Thanks for sharing it. Perhaps you'd like to put
 this into the wiki.linux-ha.org. If you do, don't forget to
 pepper the doc with YMMV.

I am going to do that.

 1. cib is a bit too lean. There are no attributes set for the ietd
 resource.

Well I have that default ressource agent, which I use for all kind of
scenarios and I just adopt it. The only time I used attributes is when I
had to. That was openvpn (I have two instances running and they use
different config files). But I am going to fix that.

 2. ietd RA is Linux specific. If it has to be then you should check if
 it runs on Linux and if not bail out with an appropriate message.

It is Linux specific (at least to my knowldege). I will add that error
message.

 3. I understand that fixing various memory parameters is important for
 ietd performance, but that has no place in the RA.  Placing those
 settings in the XML info as comment should suffice.  The admins may
 choose different settings anyway.

Actually I don't have a clue. I just looked at the example init script
that was provided by in the ietd distribution and adopted the
information given in there.

 4. There are various modprobe statements. Is that necessary? It should
 be better to assume that ietd init script has been run and then just
 add/remove new targets using the RA.

I am unaware if that is possible. But most of the time (apache, nfs
server) I work by shutting the whole thing down and start it elsewhere
even if it would be possible to it on a per lun basis.

 5. Is this RA just an example?

Yes, a working example. But to be honest I have four or three of this
RAs in production.

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] HOWTO: Build a high available iscsi Target using heartbeat, drbd and ietd for ESX Server 3.5

2008-01-21 Thread Thomas Glanzmann
Trent,

 I just did a very similar thing, except in my case I am using shared 
 storage (MD3000 - SAS) and theres a bit more fun to that part of it 
 (multipath, stonith, etc) - also  I setup heartbeat in v1 mode not CRM 
 mode.

nice, I neve had a MD3000 on my hands.

 I plan to post a walkthrough at some point in the future (I also setup
 SMB and NFS based storage) if anyone is interested.

Well, I am for sure interested. I plan to provide some education on
linux-ha myself, because for me it wasn't that easy to understand the
existing resources.

 Interesting about the ScsiSN thing - I didnt see that readme file and
 what I did to solve that problem was use a different Lun number on
 each target..  but this solution makes much more sense - thanks for
 the tip.

I would always use different LUN numbers, too. However in the future I
am going to use different LUN numbers and different ScsiSNs.

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] OCF test script (ocf-tester)

2008-01-21 Thread Thomas Glanzmann
Hello Jeff,

 Please find attached the Nagios OCF script I wrote.

thank you for sharing.

 monitor_nagios(){
 case ${NAGIOSRUNNING} in
 yes)
 if [ -f ${OCF_RESKEY_pid} ]; then
 echo ${0} MONITOR: running
 exit 0
 fi
 ;;
 no)
 if [ -f ${OCF_RESKEY_pid} ]; then
 echo ${0} MONITOR: failed
 exit 7
 else
 echo ${0} MONITOR: stopped
 exit 7
 fi
 ;;
 *)
 echo ${0} MONITOR: unknown status
 exit 1
 ;;
 esac
 }

something that popped in my eyes. The monitor status should return

0 if the resource is running
7 if it stopped
and
anything else if it is failed

Source: http://www.linux-ha.org/OCFResourceAgent

But your resource agent returns 7 when it failed.

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] OCF test script (ocf-tester)

2008-01-19 Thread Thomas Glanzmann
Hello Jeff,

 I am attempting to write an OCF compliant script for nagios.  I have
 followed the documentation here:

I attached the one I am using. Keep me posted if you do something
different.

Thomas
#!/bin/bash 

. ${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs

export PID=/var/run/nagios2/nagios2.pid
export CONFIGFILE=/etc/nagios2/nagios.cfg
export EXECUTABLE=/usr/sbin/nagios2

case  $1 in
start)
if [ -f ${PID} ]; then
kill -0 `cat ${PID}`  /dev/null  exit 0;

else 
rm -f ${PID}
fi

find /var/lib/nagios2/ -type f -print0 | xargs -0 rm
${EXECUTABLE} -d ${CONFIGFILE}
;;

stop)
if [ -f ${PID} ]; then
kill `cat ${PID}`  /dev/null
fi

rm -f ${PID}

exit 0;
;;

status)
if [ -f ${PID} ]; then
kill -0 `cat ${PID}`  /dev/null  exit 0;
fi

exit 1;
;;

monitor)
if [ -f ${PID} ]; then
kill -0 `cat ${PID}`  /dev/null  exit 0;
fi

exit 7;
;;

meta-data)
cat END
?xml version=1.0?
!DOCTYPE resource-agent SYSTEM ra-api-1.dtd
resource-agent name=nagios
version1.0/version

longdesc lang=en
OCF Ressource Agent for Nagios.
/longdesc

shortdesc lang=enOCF Ressource Agent for Nagios./shortdesc

actions
action name=start timeout=90 /
action name=stop timeout=100 /
action name=status timeout=60 /
action name=monitor depth=0 timeout=30s interval=10s 
start-delay=10s /
action name=meta-data timeout=5s /
action name=validate-all timeout=20s /
/actions
/resource-agent
END
;;
esac
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] ERROR: clone_unpack: fencing has too many children. Only the first (apache-01-fencing) will be cloned.

2008-01-18 Thread Thomas Glanzmann
Hello,

 You don't need location constraints.

okay. Could elaborate please?

Does the stonith subsystem automatically know where to put them?

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] ERROR: clone_unpack: fencing has too many children. Only the first (apache-01-fencing) will be cloned.

2008-01-18 Thread Thomas Glanzmann
Hello Lars,

 If the node fails, and the other side needs STONITH, the resource will
 be started in that partition automatically.  The location constraints
 don't hurt, but you don't need them.  STONITH resources get started
 before any STONITH operation is performed, which has roughly the same
 effect. 

I see. Okay than I remove them. I just did them because I did not
understand what happens but scratched on the uppest layer. :-)

 And yes, on a stop failure, a node might decide to fence itself too. As
 stonithd is network aware, it doesn't matter where exactly in the
 cluster the STONITH resource runs.

Good point. During my dozens of test setup at the beginning I had in
fact heartbeat instances that locked themselfes on a 'reboot' but I
never seen this problem again.

Thanks for the elaboration on this topic. I hope that I have a good
stonith implementation but I thought about making stonith high available
itself:

   # # # # # # # # # # # # # #
  /# Switch 01 #-# Switch 02 #\
 / # # # # # # # # # # # # # # \
 | | | |
 | # # # # # # # # # # # # # # |
 | # Stonith 1 #\   /# Stonith 2 # |
 | # # # # # # # \ / # # # # # # # |
 | |  /  | |
 \ # # # # # # # / \ # # # # # # # /
  \# apache 01 #/   \# apache 02 #/
   # # # # # # # # # # # # # #

A Stonith Device would like this: Atmel + Ethernet Controller + a few
optocoupler. It would get a broadcast or multicast udp frame, resets a
component and sends an acknowldege back. On the heartbeat site there
would be an application written n c which opens a udp socket, sends the
request and waits 5 Seconds for one asnwer. If it receives it, the
stonith worked, otherwise not. A friend of mine could build the hardware
in 5 days from scratch and I could write the software in 2 hours or so.
Just a thought. Let's see if it gets reality. One stonith device has ~
10 reset lines.

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] ERROR: clone_unpack: fencing has too many children. Only the first (apache-01-fencing) will be cloned.

2008-01-18 Thread Thomas Glanzmann
Lars,

 Assuming that the fencing device can be reached from all nodes, it
 doesn't matter where they are put. Only if you have, say, a serial power
 switch which is only reachable from one node do you need location
 constraints.

I have a two node cluster. I use external/ipmi which needs one instance
per node. A node that is misbehaving can't stonith itself, can it?

Is linux-ha so smart to see that the one stonith resource has to run on
the one node and the other on the other node?

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] ERROR: clone_unpack: fencing has too many children. Only the first (apache-01-fencing) will be cloned.

2008-01-18 Thread Thomas Glanzmann
Hello Dejan,

 http://developerbugs.linux-foundation.org/show_bug.cgi?id=1752

 According to this, it does matter. There really is a check in
 stonithd which prevents a node to stonith itself.

 So, I'd say that there should be a location constraint which says
 not to run a stonith resource on the same node which is to be
 fenced by that stonith resource. Otherwise, the stonith resource
 is going to be started, but it won't do its job should the need
 arise.

Thank you a lot for the clarification. So I let my setup as it is.

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Supervise but don't stop a resource

2008-01-18 Thread Thomas Glanzmann
Hello,
is it possible with linux-ha to supervice (monitor) a resource and start
it when it failed, but do not stop it when heartbeat is stopped?

I am thinking about the syslog daemon and sshd.

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Colocations and orders

2008-01-17 Thread Thomas Glanzmann
Hello Jochen,

   [ RESEND: Previous CIB was crap ]

 ipaddr
 drbd
 filesystem (for mounting drbd)
 apache
 tomcat

find a cib.xml attached. I also attached two resource agents that I
wrote myself and run on Debian Etch. Adapt for your need (hostnames and
ip address; mountmount; drbd ressource). I hope that gets you going.

Oh and a few more information:

ha.cf:
use_logd yes
bcast eth1
mcast eth0.2 239.0.0.1 694 1 0
mcast eth0.3 239.0.0.1 694 1 0
node apache-01 apache-02
watchdog /dev/watchdog
respawn hacluster /usr/lib/heartbeat/dopd
apiauth dopd uid=hacluster gid=haclient
ping 62.146.78.1
crm on

drbd.conf:
global {
usage-count no;
}

common {
syncer {
rate 100M;
}

handlers {
outdate-peer /usr/lib/heartbeat/drbd-peer-outdater;
}
}

resource gcl {
protocol C;

startup {
degr-wfc-timeout 120;
}

disk {
on-io-error pass_on;
fencing resource-only;
}

on apache-01 {
device /dev/drbd0;
disk   /dev/sda3;
address172.17.0.1:7788;
meta-disk  internal;
}

on apache-02 {
device /dev/drbd0;
disk   /dev/sda3;
address172.17.0.2:7788;
meta-disk  internal;
}
}

/var/cfengine/inputs/update:
...
if [ -x /sbin/drbdsetup ]; then
chown root:haclient /sbin/drbdsetup /sbin/drbdmeta
chmod 750 /sbin/drbdsetup /sbin/drbdmeta
chmod u+s /sbin/drbdsetup /sbin/drbdmeta
fi

Thomas
#!/bin/bash 

# . ${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs

export PID=/var/run/apache2.pid
export EXECUTABLE=/usr/sbin/apache2ctl

case  $1 in
start)
${EXECUTABLE} start  exit || exit 1;
;;

stop)
${EXECUTABLE} stop  exit || exit 1;
;;

status)
if [ -f ${PID} ]; then
kill -0 `cat ${PID}`  /dev/null  exit;
fi

exit 1;
;;

monitor)
if [ -f ${PID} ]; then
kill -0 `cat ${PID}`  /dev/null  {
wget -o /dev/null -O /dev/null -T 1 -t 1 
http://localhost/  exit || exit 1
}
fi

exit 7;
;;

meta-data)
cat END
?xml version=1.0?
!DOCTYPE resource-agent SYSTEM ra-api-1.dtd
resource-agent name=apachetg
version1.0/version

longdesc lang=en
OCF Ressource Agent for Apache.
/longdesc

shortdesc lang=enOCF Ressource Agent for Apache./shortdesc

actions
action name=start timeout=90 /
action name=stop timeout=100 /
action name=status timeout=60 /
action name=monitor depth=0 timeout=30s interval=10s 
start-delay=10s /
action name=meta-data timeout=5s /
action name=validate-all timeout=20s /
/actions
/resource-agent
END
;;
esac
#!/bin/sh 

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# OCF Ressource Agent on top of tomcat init script shipped with debian. #
#  Thomas Glanzmann --tg 21:22 07-12-30 #
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

#   This script manages a Heartbeat Tomcat instance
#   usage: $0 {start|stop|status|monitor|meta-data}
#   OCF exit codes are defined via ocf-shellfuncs 

. ${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs

case  $1 in
start)
/etc/init.d/tomcat5.5 start  /dev/null 21  exit || exit 1
;;

stop)
/etc/init.d/tomcat5.5 stop  /dev/null 21  exit || exit 1
;;

status)
/etc/init.d/tomcat5.5 status  /dev/null 21  exit || exit 1
;;

monitor)
# Check if Ressource is stopped
/etc/init.d/tomcat5.5 status  /dev/null 21 || exit 7

# Otherwise check services (XXX: Maybe loosen retry / timeout)
wget -o /dev/null -O /dev/null -T 1 -t 1 
http://localhost:8180/eccar/  exit || exit 1
;;

meta-data)
cat END
?xml version=1.0?
!DOCTYPE resource-agent SYSTEM ra-api-1.dtd
resource-agent name=tomcattg
version1.0/version

longdesc lang=en
OCF Ressource Agent on top of tomcat init script shipped with debian.
/longdesc

shortdesc lang=enOCF Ressource Agent on top of tomcat init script shipped 
with debian./shortdesc

actions
action name=start   timeout=90 /
action name=stoptimeout=100 /
action name=status timeout=60 /
action name=monitor depth=0 timeout=30s interval=10s start-delay=10s 
/
action name=meta-data  timeout=5s /
action name=validate-all  timeout=20s /
/actions
/resource-agent
END

Re: [Linux-HA] Colocations and orders

2008-01-17 Thread Thomas Glanzmann
Hello Jochen,

 ipaddr
 drbd
 filesystem (for mounting drbd)
 apache
 tomcat

find a cib.xml attached. I also attached two resource agents that I
wrote myself and run on Debian Etch. Adapt for your need (hostnames and
ip address). I hope that gets you going.

Thomas


ub-freiburg.xml
Description: XML document
#!/bin/bash 

# . ${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs

export PID=/var/run/apache2.pid
export EXECUTABLE=/usr/sbin/apache2ctl

case  $1 in
start)
${EXECUTABLE} start  exit || exit 1;
;;

stop)
${EXECUTABLE} stop  exit || exit 1;
;;

status)
if [ -f ${PID} ]; then
kill -0 `cat ${PID}`  /dev/null  exit;
fi

exit 1;
;;

monitor)
if [ -f ${PID} ]; then
kill -0 `cat ${PID}`  /dev/null  {
wget -o /dev/null -O /dev/null -T 1 -t 1 
http://localhost/  exit || exit 1
}
fi

exit 7;
;;

meta-data)
cat END
?xml version=1.0?
!DOCTYPE resource-agent SYSTEM ra-api-1.dtd
resource-agent name=apachetg
version1.0/version

longdesc lang=en
OCF Ressource Agent for Apache.
/longdesc

shortdesc lang=enOCF Ressource Agent for Apache./shortdesc

actions
action name=start timeout=90 /
action name=stop timeout=100 /
action name=status timeout=60 /
action name=monitor depth=0 timeout=30s interval=10s 
start-delay=10s /
action name=meta-data timeout=5s /
action name=validate-all timeout=20s /
/actions
/resource-agent
END
;;
esac
#!/bin/sh 

# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #
# OCF Ressource Agent on top of tomcat init script shipped with debian. #
#  Thomas Glanzmann --tg 21:22 07-12-30 #
# # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # # #

#   This script manages a Heartbeat Tomcat instance
#   usage: $0 {start|stop|status|monitor|meta-data}
#   OCF exit codes are defined via ocf-shellfuncs 

. ${OCF_ROOT}/resource.d/heartbeat/.ocf-shellfuncs

case  $1 in
start)
/etc/init.d/tomcat5.5 start  /dev/null 21  exit || exit 1
;;

stop)
/etc/init.d/tomcat5.5 stop  /dev/null 21  exit || exit 1
;;

status)
/etc/init.d/tomcat5.5 status  /dev/null 21  exit || exit 1
;;

monitor)
# Check if Ressource is stopped
/etc/init.d/tomcat5.5 status  /dev/null 21 || exit 7

# Otherwise check services (XXX: Maybe loosen retry / timeout)
wget -o /dev/null -O /dev/null -T 1 -t 1 
http://localhost:8180/eccar/  exit || exit 1
;;

meta-data)
cat END
?xml version=1.0?
!DOCTYPE resource-agent SYSTEM ra-api-1.dtd
resource-agent name=tomcattg
version1.0/version

longdesc lang=en
OCF Ressource Agent on top of tomcat init script shipped with debian.
/longdesc

shortdesc lang=enOCF Ressource Agent on top of tomcat init script shipped 
with debian./shortdesc

actions
action name=start   timeout=90 /
action name=stoptimeout=100 /
action name=status timeout=60 /
action name=monitor depth=0 timeout=30s interval=10s start-delay=10s 
/
action name=meta-data  timeout=5s /
action name=validate-all  timeout=20s /
/actions
/resource-agent
END
;;
esac
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] external/ipmi example configuration

2008-01-16 Thread Thomas Glanzmann
Hello,
the previous extern/ipmi configuration worked, but I don't know why.
However here is one that seems to be follow standard practice:

resources
primitive id=postgres-01-fencing class=stonith 
type=external/ipmi provider=heartbeat
operations
op id=postgres-01-fencing-monitor name=monitor 
interval=60s timeout=20s prereq=nothing/
op id=postgres-01-fencing-start name=start 
timeout=20s prereq=nothing/
/operations

instance_attributes id=postgres-01-fencing-ia
attributes
nvpair id=postgres-01-fencing-hostname 
name=hostname value=postgres-01/
nvpair id=postgres-01-fencing-ipaddr 
name=ipaddr value=172.18.0.121/
nvpair id=postgres-01-fencing-userid 
name=userid value=Administrator/
nvpair id=postgres-01-fencing-passwd 
name=passwd value=password/
/attributes
/instance_attributes
/primitive

primitive id=postgres-02-fencing class=stonith 
type=external/ipmi provider=heartbeat
operations
op id=postgres-02-fencing-monitor name=monitor 
interval=60s timeout=20s prereq=nothing/
op id=postgres-02-fencing-start name=start 
timeout=20s prereq=nothing/
/operations

instance_attributes id=postgres-02-fencing-ia
attributes
nvpair id=postgres-02-fencing-hostname 
name=hostname value=postgres-02/
nvpair id=postgres-02-fencing-ipaddr 
name=ipaddr value=172.18.0.122/
nvpair id=postgres-02-fencing-userid 
name=userid value=Administrator/
nvpair id=postgres-02-fencing-passwd 
name=passwd value=password/
/attributes
/instance_attributes
/primitive
/resources

constraints
rsc_location id=postgres-01-fencing-placement 
rsc=postgres-01-fencing
rule id=postgres-01-fencing-placement-rule-1 
score=-INFINITY
expression id=postgres-01-fencing-placement-exp-02 
value=postgres-02 attribute=#uname operation=ne/
/rule
/rsc_location

rsc_location id=postgres-02-fencing-placement 
rsc=postgres-02-fencing
rule id=postgres-02-fencing-placement-rule-1 
score=-INFINITY
expression id=postgres-02-fencing-placement-exp-02 
value=postgres-01 attribute=#uname operation=ne/
/rule
/rsc_location
/constraints


Last updated: Wed Jan 16 14:35:50 2008
Current DC: postgres-02 (211523e0-a549-49b7-bf29-f646915698ef)
2 Nodes configured.
4 Resources configured.


Node: postgres-02 (211523e0-a549-49b7-bf29-f646915698ef): online
Node: postgres-01 (24a3fa1b-6b62-470c-a6e1-4c1598875018): online

Full list of resources:

Master/Slave Set: ms-drbd0
drbd0:0 (heartbeat::ocf:drbd):  Master postgres-02
drbd0:1 (heartbeat::ocf:drbd):  Started postgres-01
Resource Group: postgres-cluster
fs0 (heartbeat::ocf:Filesystem):Started postgres-02
ip0 (heartbeat::ocf:IPaddr2):   Started postgres-02
pgsql0  (heartbeat::ocf:pgsql): Started postgres-02
postgres-01-fencing (stonith:external/ipmi):Started postgres-02
postgres-02-fencing (stonith:external/ipmi):Started postgres-01

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] detecting network isolation

2008-01-16 Thread Thomas Glanzmann
Hello,
I have a two-node test cluster. I added a ping statement to each of the
nodes to ping the default network. The two nodes are connected to the
same network segment and have a crosslink cable between them. When I
plug out the cable of the node that is running the service, I see the
following in the logs, but the services are not migrated over to the
node who has still good connection:

Jan 17 05:50:56 ha-2 heartbeat: [4452]: WARN: node 10.0.0.1: is dead
Jan 17 05:50:56 ha-2 heartbeat: [4452]: info: Link 10.0.0.1:10.0.0.1 
dead.
Jan 17 05:50:56 ha-2 crmd: [4470]: notice: crmd_ha_status_callback: 
Status update: Node 10.0.0.1 now has status [dead]
Jan 17 05:50:56 ha-2 crmd: [4470]: WARN: get_uuid: Could not calculate 
UUID for 10.0.0.1

So what do I have to do to configure a 'failover on network isolation
scenario'?

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] Automatic Clenaup of certain resources

2008-01-16 Thread Thomas Glanzmann
Hello,
I use Linux HA to monitor some services on a dial in machine. A so
called single node lcuster. For example sometimes my dial-in connection
or openvpn connection, or IPv6 connectivity does not come. Is there a
way to tell Linux-HA to retry a failed resource after a certain amount
of time again?

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] detecting network isolation

2008-01-16 Thread Thomas Glanzmann
Hello,

 Jan 17 05:50:56 ha-2 heartbeat: [4452]: WARN: node 10.0.0.1: is dead
 Jan 17 05:50:56 ha-2 heartbeat: [4452]: info: Link 10.0.0.1:10.0.0.1 
 dead.
 Jan 17 05:50:56 ha-2 crmd: [4470]: notice: crmd_ha_status_callback: 
 Status update: Node 10.0.0.1 now has status [dead]
 Jan 17 05:50:56 ha-2 crmd: [4470]: WARN: get_uuid: Could not 
 calculate UUID for 10.0.0.1

okay. Now I've got it. You need pingd in order to work that and set
scores as described on that website:

http://www.linux-ha.org/pingd

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Restart a Resource controlled by Heartbeat

2008-01-14 Thread Thomas Glanzmann
Hello Boroczki,

 I'd rather use kill -HUP `pidof nagios` (or something similar) to reload the
 configuration of nagios.

this is what I ended up doing.

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] ERROR: clone_unpack: fencing has too many children. Only the first (apache-01-fencing) will be cloned.

2008-01-14 Thread Thomas Glanzmann
Lars,

 Yes. You have more than one primitive within the clone, which doesn't
 work.

 Why do you do that?

Because there is no documentation, the maintainer doesn't answer to e-mail and
this was the only example that I found in the archives. And it seemed to work.
But I guess I was just lucky.

 You could either clone a group, or just not clone the two; it's not
 needed.

So could you please say that in plain xml? I still don't get it. Is that
what you have in mind?

- Don't use clone or group
- One primitive per ipmi device
- Location constraints

primitive id=postgres-01-fencing class=stonith type=external/ipmi 
provider=heartbeat
operations
op id=postgres-01-fencing-monitor name=monitor 
interval=60s timeout=20s prereq=nothing/
op id=postgres-01-fencing-start name=start timeout=20s 
prereq=nothing/
/operations

instance_attributes
attributes
nvpair id=postgres-01-fencing-hostname 
name=hostname value=postgres-01/
nvpair id=postgres-01-fencing-ipaddr name=ipaddr 
value=172.18.0.121/
nvpair id=postgres-01-fencing-userid name=userid 
value=Administrator/
nvpair id=postgres-01-fencing-passwd name=passwd 
value=password/
/attributes
/instance_attributes
/primitive

primitive id=postgres-02-fencing class=stonith type=external/ipmi 
provider=heartbeat
operations
op id=postgres-02-fencing-monitor name=monitor 
interval=60s timeout=20s prereq=nothing/
op id=postgres-02-fencing-start name=start timeout=20s 
prereq=nothing/
/operations

instance_attributes
attributes
nvpair id=postgres-02-fencing-hostname 
name=hostname value=postgres-02/
nvpair id=postgres-02-fencing-ipaddr name=ipaddr 
value=172.18.0.122/
nvpair id=postgres-02-fencing-userid name=userid 
value=Administrator/
nvpair id=postgres-02-fencing-passwd name=passwd 
value=password/
/attributes
/instance_attributes
/primitive

constraints
rsc_location id=postgres-01-fencing-placement 
rsc=postgres-01-fencing
rule id=postgres-01-fencing-placement-1 score=INFINITY
expression attribute=#uname operation=eq 
value=postgres-02/
/rule
/rsc_location
rsc_location id=postgres-02-fencing-placement 
rsc=postgres-02-fencing
rule id=postgres-02-fencing-placement-2 score=INFINITY
expression attribute=#uname operation=eq 
value=postgres-01/
/rule
/rsc_location
/constraints

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Running Linux-HA on a single node cluster

2008-01-14 Thread Thomas Glanzmann
Hello,

 I have 9 machines configured as 6 clusters:
 ~

and I can't count. But I have a ninth server who does smtp. But it will
soon go away and get a ha resource.

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Running Linux-HA on a single node cluster

2008-01-14 Thread Thomas Glanzmann
Hello Andrew,

 looks sane enough - though linux-ha is slightly heavy for just
 monitoring processes in a cluster-of-one.

 any reason not to make it a four node cluster?

I have 9 machines configured as 6 clusters:

- 2x apache (ha resources: router; openvpn; nagios; apache +
  mod jk; drbd + nfs; fencing)

- 4x tomcat (ha resources: tomcat; one cluster per node)

- 2x postgres (ha resources: drbd + postgres)

I decided to keep them as different clusters so that they don't
interfere with each other.

If I would turn the tomcats in a four node cluster, how would it look
like in xml? Four tomcats each with a strong affinity to a node?

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] ERROR: clone_unpack: fencing has too many children. Only the first (apache-01-fencing) will be cloned.

2008-01-14 Thread Thomas Glanzmann
Hello Andrew,

 does that help?

yes it does. I have a test cluster. I will write a pseudo plugin or use
the ssh one to simulate the behaviour and come back to you if I have
something to work with. I am still not sure how it works, but maybe I
simply should start to read source code.

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] ERROR: clone_unpack: fencing has too many children. Only the first (apache-01-fencing) will be cloned.

2008-01-13 Thread Thomas Glanzmann
Hello,
could someone tell me what is wrong with that fencing configuration:

Jan 13 11:38:48 apache-02 pengine: [13769]: ERROR: clone_unpack: fencing has 
too many children.  Only the first (apache-01-fencing) will be cloned.
Jan 13 11:38:48 apache-02 pengine: [13769]: info: process_pe_message: 
Configuration ERRORs found during PE processing.  Please run crm_verify -L to 
identify issues.
(apache-01) [/var/adm/syslog/2008/01/13] / crm_verify -L -V
crm_verify[5917]: 2008/01/13_11:44:57 ERROR: clone_unpack: fencing has too many 
children.  Only the first (apache-01-fencing) will be cloned.
crm_verify[5917]: 2008/01/13_11:44:57 WARN: native_assign_node: 2 nodes with 
equal score (+INFINITY) for running the listed resources (chose apache-02):
crm_verify[5917]: 2008/01/13_11:44:57 WARN: native_assign_node: 2 nodes with 
equal score (+INFINITY) for running the listed resources (chose apache-01):
crm_verify[5917]: 2008/01/13_11:44:57 WARN: native_assign_node: 2 nodes with 
equal score (+INFINITY) for running the listed resources (chose apache-01):
Errors found during check: config not valid

Here is my current fencing policy. But everything seems to work. I use the
2.1.3 version:

configuration
crm_config
cluster_property_set id=cib-bootstrap-options
attributes
nvpair name=stonith-enabled value=true 
id=stonith-enabled/
nvpair name=stonith-action value=reboot 
id=stonith-action/
/attributes
/cluster_property_set
/crm_config

resources
clone id=fencing
instance_attributes id=ia-fencing-01
attributes
nvpair id=fencing-01 
name=clone_max value=2/
nvpair id=fencing-02 
name=clone_node_max value=1/
/attributes
/instance_attributes

primitive id=apache-01-fencing class=stonith 
type=external/ipmi provider=heartbeat
operations
op id=apache-01-fencing-monitor 
name=monitor interval=60s timeout=20s prereq=nothing/
op id=apache-01-fencing-start 
name=start timeout=20s prereq=nothing/
/operations

instance_attributes id=ia-apache-01-fencing
attributes
nvpair 
id=apache-01-fencing-hostname name=hostname value=apache-01/
nvpair 
id=apache-01-fencing-ipaddr name=ipaddr value=172.18.0.101/
nvpair 
id=apache-01-fencing-userid name=userid value=Administrator/
nvpair 
id=apache-01-fencing-passwd name=passwd value=password/
/attributes
/instance_attributes
/primitive

primitive id=apache-02-fencing class=stonith 
type=external/ipmi provider=heartbeat
operations
op id=apache-02-fencing-monitor 
name=monitor interval=60s timeout=20s prereq=nothing/
op id=apache-02-fencing-start 
name=start timeout=20s prereq=nothing/
/operations

instance_attributes id=ia-apache-02-fencing
attributes
nvpair 
id=apache-02-fencing-hostname name=hostname value=apache-02/
nvpair 
id=apache-02-fencing-ipaddr name=ipaddr value=172.18.0.102/
nvpair 
id=apache-02-fencing-userid name=userid value=Administrator/
nvpair 
id=apache-02-fencing-passwd name=passwd value=password/
/attributes
/instance_attributes
/primitive
/clone

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] MailTo Resource specified wrong?

2008-01-11 Thread Thomas Glanzmann
Hello Kirby,

 WARNING: Don't stat/monitor me! MailTo is a pseudo resource agent, so
 the status reported may be incorrect

 I guess if I had to guess, I'd probably delete the 'MailTo_6_mon'
 line...  But I don't know if that'll affect the mail I get when
 heartbeat switches things around

If you have a look at the ressource agent:

/usr/lib/ocf/resource.d/heartbeat/MailTo

You see that the status/monitor section is not needed to notify you. You
may delete that monitor operation. However if I look at the MailTo RA
that comes with version 2.1.3 I see that the warning message is
commented out:

MailToStatus () {
#   ocf_log warn Don't stat/monitor me! MailTo is a pseudo resource agent, 
so the status reported may be incorrect

if ha_pseudo_resource MailTo_${OCF_RESOURCE_INSTANCE} monitor
then
echo running
return $OCF_SUCCESS
else
echo stopped
return $OCF_NOT_RUNNING
fi
}

So you have more than once choice (first one is the best):

- Update to a recent version (2.1.3)

- Kill the Operation Section including the Monitor Statement from your
  MailTo configuration

- Go to the ressource agent and comment that warning out (as it is per
  defualt on more recent versions)

- Leave everything as it is and live with the warnings.

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] debian and heartbeat

2008-01-10 Thread Thomas Glanzmann
Hello,

 honestly, i would not use this repository for my upgrades as - at
 least in the past - major changes have been introduced during the
 heartbeat 2.1.3 development. for example the constraints were heavily
 modified.

I wouldn't use it for production either. But my point still stands this
repository is hard to use if there isn't a ready to go apt line to
work with. And for new users ... and I was a new user two weeks ago ...
it just gets in your way (first impression counts).

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Monitoring Apache (v2.0.8)

2008-01-09 Thread Thomas Glanzmann
Hello Alon,
I would update to 2.1.3 (I am not sure if that is your problem). And
make the interval for the monitor operation higher. At the moment it
seems to be scheduled each second.

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] debian and heartbeat

2008-01-09 Thread Thomas Glanzmann
Hello Michael,

 http://www.ultramonkey.org/download/heartbeat/2.1.3/

which Debian Release do you use?

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] debian and heartbeat

2008-01-09 Thread Thomas Glanzmann
Hello Michael,

 etch.
 debian_version is 4.0
 apt-get update and upgrade done.

the packages you try to use are for Debian Sid.

You can do of the following things:

- put deb http://131.188.30.102/~sithglan/linux-ha-die2te/ ./
  into /etc/apt/sources.list and call apt-get update; apt-get install 
heartbeat

- Build the packages by yourself:

wget 
http://www.ultramonkey.org/download/heartbeat/2.1.3/debian-sid/heartbeat_2.1.3-2.dsc
wget 
http://www.ultramonkey.org/download/heartbeat/2.1.3/debian-sid/heartbeat_2.1.3-2.diff.gz
wget 
http://www.ultramonkey.org/download/heartbeat/2.1.3/debian-sid/heartbeat_2.1.3.orig.tar.gz
dpkg-source -x heartbeat_2.1.3-2.dsc
cd heartbeat-2.1.3/
fakeroot debian/rules binary

  And put them into a directory. I always use the following Makefile to
  create a debian package repository:

TARGET=.

all:
@touch Release
@apt-ftparchive packages $(TARGET) | tee $(TARGET)/Packages | gzip -c  
$(TARGET)/Packages.gz

Remeber before the '@'s are tabs not spaces.

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] debian and heartbeat

2008-01-09 Thread Thomas Glanzmann
Hello,

 Is this a regularly updated repository with the heartbeat ldirectord
 packages (and only those packages)?

yes, it is. But in the future the path will be

deb http://131.188.30.102/~sithglan/ha/

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] debian and heartbeat

2008-01-09 Thread Thomas Glanzmann
Hello,
btw. the problem was that I build the packages on machine that had a
sarge gnutls-dev installed. I upgraded the package and just rolled it
out on 9 machines everything is up and running. :-)

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] debian and heartbeat

2008-01-09 Thread Thomas Glanzmann
Hello Andrew,

 http://download.opensuse.org/repositories/server:/ha-clustering/

do you have a apt line to use that location? I tried to make something
up but failed.

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] auto_failback off, but the resource group still fails back.

2008-01-09 Thread Thomas Glanzmann
Jason,
just to get sure that we're on the same page here:

- You have a two node cluster

- You have a resource that is running only on one node

- When you run the resource on node b, node a reports for that
  resource a failed monitor?

If that is the case then something is horrible wrong. Because the
monitor operation for a resource should only run on a node that is
currently running the resource.

What I thought before is the following:

You start your resource, heartbeat tries to start your resource on node
a, the monitor that heartbeat starts right after it tried to start the
resource to verify that is everything all right, fails. So heartbeat
decides to run the resource on node b, calls monitor to verify
afterwards and everything is fine. And you end up with the failed
monitor on node a reported in crm_mon -1 -r.

So might it be possible that your monitor action _always_ fails on node
a even if the service is correctly started on that node? Try to start
your service (with heartbeat stopped on both machines) and then try to
call your resouce agent with the monitor argument and see if it does the
right thing or does not.

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Re: problems with ha, drbd and filesystems resource

2008-01-09 Thread Thomas Glanzmann
Hello Stephan,

 No ideas about the problem?

I think the question was already answered by someone on the list.
Heartbeat doesn't support drbd-0.8 at the moment. Eg. you can run a
primary/secondary cluster but not a primary/primary cluster. So one who
understand what he is doing has to adopt the drbd ressource agent to
understand the primary/primary scenario.

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] auto_failback off, but the resource group still fails back.

2008-01-09 Thread Thomas Glanzmann
Hello Jason,

 1) It nominates a node as DC (in this case, node2, though I've seen both)
 2) The 'failed actions' block get's these lines almost immediately:

   resource_samba_storage_monitor_0 (node=node2.domain.com,
 call=3, rc=9): Error
   resource_samba_storage_monitor_0 (node=node1.domain.com,
 call=3, rc=9): Error

 3) then the resource group starts in order on node2. (IP, then
 storage, then daemon)

strange. I have the picture now, but I am still unsure where it is
coming from. Can you update to version 2.1.3 and try again with the
exact same configuration? Could also send me the syslog of the daemon
facility of one or both nodes (if possible only from a complete restart
(hardbeat stopped on both nodes and then start them both ...) till the
problem pops up)?

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] debian and heartbeat

2008-01-09 Thread Thomas Glanzmann
Hello,

 http://download.opensuse.org/repositories/server:/ha-clustering/Debian_Etch/Packages
  

 the debian folks are good, but not quite that good..see:
 http://ccrma.stanford.edu/planetccrma/man/man5/sources.list.5.html
 for details on how  to setup a custom apt source.

I read the manpage. I am still looking for a line that I can put in my
/etc/apt/sources.list . Does someone has such a line? Does someone use
that repository? If that is the case could that one be so kind to post
simply that apt line and maybe publish it on the Download Heartbeat
page. So if a new heartbeat user comes along he just adds that line to
his /etc/apt/sources.list and type:

apt-get update
apt-get install heartbeat

And has a up2date heatbeat package?

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Can I use different interfaces in different nodes?

2008-01-09 Thread Thomas Glanzmann
Hello,

 I want to setup a two nodes httpd cluster with heartbeat, and the
 configuration listed below:

that shouldn't be a problem, just adopt the ha.cf on each node to
reflect the network card configuration.

 And one more question, can I use bcast in VLAN environment?

You can. I have it running on one:

(postgres-01) [~] cat /etc/ha.d/ha.cf
use_logd yes
bcast eth1
mcast eth0.2 239.0.0.2 694 1 0
node postgres-01 postgres-02
respawn hacluster /usr/lib/heartbeat/dopd
apiauth dopd uid=hacluster gid=haclient
crm on

eth0.2 is a tagged VLAN port which I use using a multicast statement. But
broadcast is also possible. I use multicast because I have 6 heartbeat clusters
on that subnet.

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] hb2: making xml manageable

2008-01-08 Thread Thomas Glanzmann
Hello,

 1. Without  restarting or shutting down the cluster, and without
 editing the cib.xml file how can I make a change to the cluster
 configuration (i.e. how can I use haresources2cib.py to generate an
 updated cib.xml and get the cluster to use it without a restart)

I use 8 space wide indenting with that it gets very readable. You can
use cibadmin -Q to dump the configuration. Kick out the cib tags and
the status section than you have a template to work with. Get sure
that you have a unique identifier specified for each section.

I use cibadmin -U -x /path/to/file.xml to update my configuration
which works quiet well as long as you don't put resources into a
resource group that were previous outside or vice versa.

If you want a fres start you can always bring all cluster nodes down
call rm /var/lib/heartbeat/crm/*, bring them up again and call the
above command to get things going again.

 2. How can I override the defaults (such as timeout) in the resulting
 cib.xml file?

You just add a parameter. And call cibadmin -U -x /path/to/file.

Here is my cheat sheet you may find it helpful:

 # linux ha

 # Status:
crm_mon -1 -r

 # Dump XML Tree
cibadmin -Q

 # Add a single resource
cibadmin -o resources -C -x 01_drbd

 # Update a single resource
cibadmin -o resources -U -x 02_filesystem

 # Add a single constraint
cibadmin -o constraints -C -x 03_constraint_run_on

 # Update using a input produced from 'cibadmin -Q' minus crm tags and 
without the status section
cibadmin -U -x postgres.xml

 # Use the cluster default to determine if a resource should get started
crm_resource -r ms-drbd0 -v '#default' --meta -p target_role

 # Migrate a Resource to a host:
crm_resource -M -r postgres-cluster -H postgres-01

 # A nice man page with many examples at the end
man crm_resource

 # Check if a node is in standby
crm_standby -G -U postgres-01
crm_mon -1 -r

 # Put node into standby mode
crm_standby -U postgres-01 -v on

 # Make node active again (the two commands have the same effect)
crm_standby -U postgres-01 -v off
crm_standby -D -U postgres-01

 # Cleanup (Retry to Start after manual intervention) Resource
crm_resource -C -r tomcat-02 -H tomcat-02

 # Remove a statemen by id if it happens to be twice in there (should be fixed 
upstream)
cibadmin -o resources -D -X 'op id=0a71bc1a-b460-49bb-9d0d-2fe3ada169b9 
name=monitor interval=60s timeout=120s start_delay=1m/'

http://fghaas.wordpress.com/2007/10/04/checking-your-secondarys-integrity/

 # Reload CIB completly
 # - KILL CIB and STATUS Tags including content
 # Wipe old content (Attention don't do that during production it gets your 
service down: cibadmin -E)
cibadmin -U -x /path/to/profile.xml

http://www.mail-archive.com/linux-ha@lists.linux-ha.org/msg03187.html
http://www.linux-ha.org/HaNFS
http://www.linux-ha.org/DRBD/NFS
http://www.linux-ha.org/DRBD/HowTov2

drbdsetup /dev/drbd0 primary -o

 # Split Brain Manual Recovery
 # Primary Node:
drbdadm connect all

 # Attach Split Brain Node as Secondary
drbdadm -- --discard-my-data connect

 # Force Split Brain Secondary to be Primary:
drbdadm -- --overwrite-data-of-peer primary all

 # Reload Configuration
/etc/init.d/heartbeat reload
drbdadm adjust all

 # Heartbeat Broadcast link went missing (Attention: All Services get stopped 
on the node)
/etc/init.d/heartbeat restart

http://blogs.linbit.com/florian/2007/10/01/an-underrated-cluster-admins-companion-dopd/

 # List Configured OCF Agents and to operations on them (backend command)
/usr/lib/heartbeat/lrmadmin

 # XML Template generator:
python /usr/lib/heartbeat/crm_primitive.py

 # Delete Cluster Configuration (Attention: Nor for production. The command has
 # to be issued on all nodes and all ha services must be stopped on all nodes
 # in the cluster)
rm /var/lib/heartbeat/crm/*

I attached you an example postgres cluster configuration.

Thomas


postgres.xml
Description: XML document
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] Howto list all available agents and there possible attributes

2008-01-08 Thread Thomas Glanzmann
Hello Simon,

 I double checked and 2.1.3-2 does include both
 /usr/lib/stonith/plugins/external/ipmi and /usr/sbin/ciblint

I can confirm this. I used your diff, dsc and orig file to build a
package for Debian Etch (4.0). I am going to roll out the version
tonight on my production cluster (9 nodes). Thanks for fixing the
issues.

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] coding bugfix for lib/plugins/stonith/ipmilan.c

2008-01-08 Thread Thomas Glanzmann
Hello Dejan,

 Configuration is comparable to the external/ipmi. Just check the
 parameter names and adjust the stonith type.

I see. So there is no need to touch ha.cf? Just add the ipmilan thing to
the cib.xml and that's it?

 Thomas, if you could also do additional testing, that'd be great.

I will, but the thing is that I have to test it into a production
environment so I have to take it slow. I am happy that my production
system does what it is supposed to be right now. :-)

But on the weekends I can take two of four tomcats offline and give
ipmilan a try.

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] auto_failback off, but the resource group still fails back.

2008-01-08 Thread Thomas Glanzmann
Hello Jason,

 1) For the monitor action, I might suggest the docs be updated
 slightly.  According to http://www.linux-ha.org/OCFResourceAgent, 0
 for 'running', 7 for 'stopped', and anything else is valid, but
 indicates an error.

 I have modified my script to only return '1' on error.  However, the
 same issue persists (Error in resource_samba_storage_monitor_0).

can you check to run it manually. Your ressource agent doesn't seem to
need any arguments so you should be able to do that:

- stop heartbeat on node2

- Wait a while until you see the state again

- Call /path/to/resource_samba_storage monito  echo Okay || echo 
Failed

I am pretty sure that this gives you a Failed. Than track it
down and get sure that it returns Okay.

 2) crm_resource -C does not seem to have any effect.  I could not be
 looking in the right place though.

Get sure that you didn't misspelled the name of the service and the name
of the host. If you have your service running on node3 and call the
cleanup. The failled monitor transaction has to vanish from the output
of crm_mon -1 -r.

 3) The cibadmin -Q results are attached.

Thanks. Looks fine for me. What you could always try is the following it
wipes all states for sure:

- Stop heartbeat on both nodes
- Call rm /var/lib/heartbeat/crm/* on both nodes (that wipes
  your linux ha config including any logged states)

- Start heartbeat on both nodes

- call cibadmin -U -X'
configuration
resources
group id=group_samba
primitive id=resource_samba_ip class=ocf 
type=IPaddr provider=heartbeat
instance_attributes 
id=resource_samba_ip_instance_attrs
attributes
nvpair 
id=a572b727-1aaf-43b5-b2c3-826305b6d533 name=ip value=10.31.11.114/
nvpair 
id=3669be20-7782-432e-9504-ba65668186ca name=nic value=eth0/
nvpair 
id=14fbbd65-3019-43e9-8eb5-d9fb88deced2 name=cidr_netmask 
value=255.255.255.0/
nvpair 
id=941cbbab-490d-48e7-88a8-3d4d62ff79a5 name=broadcast 
value=10.31.11.255/
nvpair 
id=e44afc64-42cb-470e-b418-9a981991ee02 name=iflabel value=vtqfs/
/attributes
/instance_attributes
operations
op name=monitor interval=60s 
timeout=120s start_delay=1m id=monitor-samba-ip/
/operations
/primitive

primitive id=resource_samba_storage class=lsb 
type=hb-vxvol provider=heartbeat
operations
op name=monitor interval=60s 
timeout=120s start_delay=1m id=monitor-samba-storage/
/operations

/primitive

primitive id=resource_samba_daemon class=lsb 
type=hb-samba provider=heartbeat
operations
op name=monitor interval=60s 
timeout=120s start_delay=1m id=monitor-samba-daemon/
/operations
/primitive
/group
/resources
/configuration
'

Oh, and what I don't get at all is the following: In your CIB you did not had a
monitor action defined at all. At least with 2.1.3 means that your resource
isn't monitored at all with one exception: Right after it is started. Could it
be possible that your samba_storage RA does fork of another process in the
background and returns immediatly? If that is the case, adopt the RA to don't
do that. It should only return when the service is running. And running means
it should be in a state so that monitor returns 0. The Postgres RA is a good
example for this, I guess. I added monitor operations to your configuration.

Nevertheless you should upgrade to 2.1.3 as soon as possible.

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Failover iscsi SAN

2008-01-07 Thread Thomas Glanzmann
Hello Michael,
could you please send me your ocf resource agent for ietd and the Output
of cibadmin -Q without the status section. That is because I want to
do such a setup by myself. Have you tested and initiators with that
setup. I would like to use it with ESX Server Version 3.5. And would
like to know if the ESX 3.5 server works when switching the service.

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Howto list all available agents and there possible attributes

2008-01-07 Thread Thomas Glanzmann
Hello Andrew,

 I believe it was considered too broken to continue shipping.  None of
 us have the required hardware to test/fix/maintain the relevant code.

I think you believe wrong. The external/ipmi plugin works out of the box
and perfectly fine at least for me. Just the documentation is missing
but once you get the idea how to configure it, it is straight forward.

When you call the debian/rules file which is shipped with 2.1.3 the
checklint (or whatever the tool is called) missing and the version
number is wrong but the external/ipmi tool is packaged. When I call the
dsc file from the website the first two things are okay but
external/ipmi is missing.

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Failover iscsi SAN

2008-01-07 Thread Thomas Glanzmann
Hello Niels,

 My personal experience with ietd is that it really doesn't like to be
 stopped if it is in use. (i.e. kernel panics, kernel hangs etc.). I
 would do some carefull testing before trying to use this in a
 heartbeat environment.

I just want a proof of concept nor a production system. But I also saw
ietd panicing the ietd host system but when I tried to access a
blockdevice using a raw device mapping from a VMware ESX 3.0.1 server.
This was reproducable. But as soon as I have soemthing that works I will
report back. I never had problems with starting and stopping though.

Btw. did you compile ietd by yourself or did you use a distribution like
for example SLES10 (that one ships a ietd).

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


[Linux-HA] drbd + ocfs2

2008-01-07 Thread Thomas Glanzmann
Hello,
I would like to know how to setup a drbd + ocfs2 installation with two
masters? What ocf agent do I have to use for that? Has someone a working
example configuration? I would like to use heartbeat-2.1.3.

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Jan 2 23:25:01 postgres-02 tengine: [8736]: ERROR: te_graph_trigger: Transition failed: terminated

2008-01-07 Thread Thomas Glanzmann
Hello Andrew,

 Thanks - the PE is now smart enough to at least filter out the
 duplicates :-)

thanks a lot for getting rid of this annoying bug. :-)

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Howto list all available agents and there possible attributes

2008-01-07 Thread Thomas Glanzmann
Hello Simon,

 Thanks, I'll look into this. Though I was under the impression that
 the ipmi module was broken. Has it been fixed?

there are two ipmi modules:

- ipmilan (a c implementation, that is not build by default)
- external/ipmi (a shell script)

The first one was indeed broken because it did not compile at the time
2.1.3 was released but I saw a simple patch on the list (two brackets
were missing).

The Shell implementation just works. At least in the tests I did (play
dead fish in the water and wait 10 seconds - power cycled by the other
node).

And as you already found out was the Makefile in the external directory
from an older revision. Sorry I should have mentioned it because I
already found the problem.

Anyway thanks for fixing this. I am going to pull the new ones and build
a package.

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] coding bugfix for lib/plugins/stonith/ipmilan.c

2008-01-07 Thread Thomas Glanzmann
Hello,

 But it stops. If you have the machine with IPMI interface, could you
 test my patch?

do you have a confugration for me. I have a machine with ipmi and the
external/ipmi stonith works for me. If you can walk me through
configuring ipmilan I can give it a spin.

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] problems with ha, drbd and filesystems resource

2008-01-06 Thread Thomas Glanzmann
Hello Stephan,
could you please attach your config?

cibadmin -Q and drop the status section?

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Linux-HA Service Monitoring

2008-01-04 Thread Thomas Glanzmann
Hello Jayaprakash,

 I Place the new script in
 /usr/lib/ocf/resource.d/heartbeat/.ocf-shellfuncs and execute the
 following commands.

hopefully you did not but from the output I can tell that you didn't.
You put it where it belongs.

 If possible come to online, we discuss in detailed.. My id
 Yahoo/gmail/msn id :  jp.aspm

I don't do instant messaging. Except for E-Mail. :-)

 I'm using Fedora 7 and Heartbeat Version 2.1.2

I see.

Okay the problem is that the Fedora 7 init script is horrible broken.
Could you please send me /etc/init.d/squid from your machine via e-mail.
Send it directly to me otherwise people are going to scream at us. :-)

Than I write you an ocf agent that doesn't rely on the init script that
is shipped with Fedora 7.

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] external/ipmi example configuration

2008-01-04 Thread Thomas Glanzmann
Hello Dominik,

 How can I test the stonith plugin eg. tell heartbeat to shoot someone?

 iptables -I INPUT -j DROP

Okay. That is obvious. Play dead fish in the water. Lucky me that I
don't have a serial heartbeat. Thanks.

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] external/ipmi example configuration

2008-01-04 Thread Thomas Glanzmann
Hello Dejan,
I searched the archive but looked for ipmi in the subject, but now
that you mentioned it I searched for external stonith and I found an
example.

 See http://linux-ha.org/ExternalStonithPlugins for an example.  You
 can also search the archive of this list for more examples.

I read that page over and over but did not get it. But now I think I get
it. I need one primitive per IPMI device. That was the information
that I missed. So this should do the job, shouldn't it?

clone id=DoFencing
instance_attributes
attributes
nvpair name=clone_max value=2/
nvpair name=clone_node_max value=1/
/attributes
/instance_attributes

primitive id=postgres-01-fencing class=stonith 
type=external/ipmi provider=heartbeat
operations
op id=postgres-01-fencing-monitor name=monitor 
interval=5s timeout=20s prereq=nothing/
op id=postgres-01-fencing-start name=start 
timeout=20s prereq=nothing/
/operations

instance_attributes
attributes
nvpair id=postgres-01-fencing-hostname 
name=hostname value=postgres-01/
nvpair id=postgres-01-fencing-ipaddr 
name=ipaddr value=172.18.0.121/
nvpair id=postgres-01-fencing-userid 
name=userid value=Administrator/
nvpair id=postgres-01-fencing-passwd 
name=passwd value=whatever/
/attributes
/instance_attributes
/primitive

primitive id=postgres-02-fencing class=stonith 
type=external/ipmi provider=heartbeat
operations
op id=postgres-02-fencing-monitor name=monitor 
interval=5s timeout=20s prereq=nothing/
op id=postgres-02-fencing-start name=start 
timeout=20s prereq=nothing/
/operations

instance_attributes
attributes
nvpair id=postgres-02-fencing-hostname 
name=hostname value=postgres-02/
nvpair id=postgres-02-fencing-ipaddr 
name=ipaddr value=172.18.0.122/
nvpair id=postgres-02-fencing-userid 
name=userid value=Administrator/
nvpair id=postgres-02-fencing-passwd 
name=passwd value=whatever/
/attributes
/instance_attributes
/primitive
/clone

How can I test the stonith plugin eg. tell heartbeat to shoot someone?

Thomas
___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


  1   2   >