Re: [Linux-HA] Antw: Re: file system resource becomes inaccesible when any of the node goes down

2015-07-09 Thread Lars Marowsky-Bree
On 2015-07-07T14:15:14, Muhammad Sharfuddin m.sharfud...@nds.com.pk wrote:

 now msgwait timeout is set to 10s and a delay/inaccessibility of 15 seconds
 was observed. If a service(App, DB, file server) is installed and running
 from the ocfs2 file system via the surviving/online node, then
 wouldn't that service get crashed or become offline due to the
 inaccessibility of the file system(event though its ocfs2) while a member
 node goes down ?

You're seeing a trade-off of using OCFS2. The semantics of the file
system require all accessing nodes to be very closely synchronized (that
is not optional), and that requires the access to the fs to be paused
during recovery. (See the CAP theorem.)

The apps don't crash, they are simply blocked. (To them it looks like
slow IO.)

The same is true for DRBD in active/active mode; the block device is
tightly synchronized, and this requires both nodes to be up, or cleanly
reported as down.

 If cluster is configured to run the two independent services, and starts one
 on node1 and ther on node2, while both the service shared the same file
 system, /sharedata(ocfs2),  then in case of a failure of one node, the
 other/online wont be able to
 keep running the particular service because the file system holding the
 binaries/configuration/service is not available for around at least 15
 seconds.
 
 I don't understand the advantage of Ocfs2 file system in such a setup.

If that's your setup, indeed, you're not getting any advantages. OCFS2
makes sense if you have services that indeed need access to the same
file system and directory structure.

If you have two independent services, or even services that are
essentially node local, you're much better off using independent,
separate file system mounts with XFS or extX.



Regards,
Lars

-- 
Architect Storage/HA
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Dilip Upmanyu, Graham 
Norton, HRB 21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list is closing down.
Please subscribe to us...@clusterlabs.org instead.
http://clusterlabs.org/mailman/listinfo/users
___
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha


Re: [Linux-HA] Antw: Re: Antw: Re: file system resource becomes inaccesible when any of the node goes down

2015-07-09 Thread Lars Marowsky-Bree
On 2015-07-07T12:23:44, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

 The advantage depends on the alternatives: If two nodes both want to access 
 the same filesystem, you can use OCFS2, NFS, or CIFS (list not complete). If 
 only one node can access a filesystem, you could try any journaled filesystem 
 (a fsck is needed after a node crash).

A journaled file system does not require a fsck after a crash.

 If you use NFS or CIFS, you shift the problem to another machine.
 
 If you use a local filesystem, you need recovery, mount, and start of your 
 application on a standby node.
 
 With OCFS2 you'll have to wait for a few seconds before your application can 
 continue.

The recovery happens in the background with OCFS2 as well; the fs
replays the failed node's journal in the background. The actual time
saved by avoiding the mount is negligible.



Regards,
Lars

-- 
Architect Storage/HA
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Dilip Upmanyu, Graham 
Norton, HRB 21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list is closing down.
Please subscribe to us...@clusterlabs.org instead.
http://clusterlabs.org/mailman/listinfo/users
___
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha


Re: [Linux-HA] crmsh fails to stop already stopped resource

2015-02-16 Thread Lars Marowsky-Bree
On 2015-02-16T09:20:22, Kristoffer Grönlund kgronl...@suse.com wrote:

 Actually, I decided that it does make sense to return 0 as the error
 code even if the resource to delete doesn't exist, so I pushed a commit
 to change this. The error message is still printed, though.

I'm not sure I agree, for once.

Idempotency is for resource agent operations, not necessarily all
operations everywhere. Especially because crmsh doesn't know whether the
object doesn't exist because it was deleted, or because it was
misspelled.

Compare the Unix-as-little-else rm command; rm -f /tmp/idontexist will
give an error code.

Now a caller of crmsh has to *parse the output* to know whether the
delete command succeeded or not, which is rather non-trivial.

If the caller doesn't care whether the command succeeded or not, it
should be the caller that ignores the error code.

Or if you want to get real fancy, return different exit codes for
referenced object does not exist, or generic syntax error.

  Following fails with the current crmsh (e4b10ee).
  # crm resource stop cl-http-lv
  # crm resource stop cl-http-lv
  ERROR: crm_diff apparently failed to produce the diff (rc=0)
  ERROR: Failed to commit updates to cl-http-lv
  # echo $?
  1

And, yeah, well, this shouldn't happen. Here idempotency applies ;-)



Regards,
Lars

-- 
Architect Storage/HA
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Jennifer Guild, Dilip 
Upmanyu, Graham Norton, HRB 21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: Antw: Re: Antw: Re: Antw: Re: SLES11 SP3: warning: crm_find_peer: Node 'h01' and 'h01' share the same cluster nodeid: 739512321

2015-01-30 Thread Lars Marowsky-Bree
On 2015-01-30T14:57:29, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

  # grep -i high /etc/corosync/corosync.conf
  clear_node_high_bit:new
  
  Could this cause our problem?
  This is an option that didn't exist prior to SP3.
 With there was no change meant: No administrator did change the file; if a
 package installation changed the file, I'm not aware of it.

Hrm. It's definitely not an option that is meant to be changed
automatically. Because it would break upgrades if it was. If you have
logs that cover the upgrade process, I'm pretty sure someone would be
willing to investigate.

  And yes, changing this option would cause exactly the issue you've seen.
 
 So it's this phrase from the manual page?:
 
   WARNING:  The  clusters  behavior is undefined if this option
 is
   enabled on only a subset of the cluster (for  example  during 
 a
   rolling upgrade).

Among others, yes. Basically, it changes how the automated nodeid is
computed. Clearly, that shouldn't be done unnecessarily. And also,
pacemaker should cope with changing nodeids; I thought recent (I forgot
when exactly) pacemaker updates fixed this. It can be a bit messy at
times.

 BTW: The manual page does not says that nodeid has to be a positive 32-bit
 2-complement; the page just says it's a 32-bit number...

For corosync, it doesn't matter, this is a limitation actually coming
from the DLM in-kernel. Isn't that nice. ;-)


Regards,
Lars

-- 
Architect Storage/HA
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Jennifer Guild, Dilip 
Upmanyu, Graham Norton, HRB 21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: Antw: Re: Antw: Re: SLES11 SP3: warning: crm_find_peer: Node 'h01' and 'h01' share the same cluster nodeid: 739512321

2015-01-30 Thread Lars Marowsky-Bree
On 2015-01-30T08:23:14, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

 Two of the three nodes were actually updated from SP1 via SP2 to SP3, and the
 third node was installed with SP3. AFAIR there was no configuration change
 since SP1.

That must be incorrect, because:

  Was the corosync.conf option clear_node_high_bit changed?
 
 # grep -i high /etc/corosync/corosync.conf
 clear_node_high_bit:new
 
 Could this cause our problem?

This is an option that didn't exist prior to SP3.

And yes, changing this option would cause exactly the issue you've seen.

The old default behaviour was buggy, but since we couldn't fix it w/o
this, we introduced new and defaulted new clusters to this; existing
configuration files would remain on their old mode and thus remain
compatible.


Regards,
Lars

-- 
Architect Storage/HA
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Jennifer Guild, Dilip 
Upmanyu, Graham Norton, HRB 21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: SLES11 SP3: warning: crm_find_peer: Node 'h01' and 'h01' share the same cluster nodeid: 739512321

2015-01-28 Thread Lars Marowsky-Bree
On 2015-01-28T16:21:23, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

 Kind of answering my own question:
 Node id 84939948 in hex is 051014AC, which is 5.16.20.172 where the IP 
 address actually is 172.20.16.5.
 But I see another node ID of 739512325 (hex 2C141005) which is 44.20.16.5. 
 That seems revered compared to the above, and the 0x2c doesn't fit anywhere.

It does. The highbit was stripped.


Regards,
Lars

-- 
Architect Storage/HA
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Jennifer Guild, Dilip 
Upmanyu, Graham Norton, HRB 21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: Antw: Re: SLES11 SP3: warning: crm_find_peer: Node 'h01' and 'h01' share the same cluster nodeid: 739512321

2015-01-28 Thread Lars Marowsky-Bree
On 2015-01-28T16:44:34, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

  address actually is 172.20.16.5.
  But I see another node ID of 739512325 (hex 2C141005) which is 
  44.20.16.5. 
  That seems revered compared to the above, and the 0x2c doesn't fit anywhere.
  
  It does. The highbit was stripped.
 
 So the logic is If the hight-bit is not stripped, the ID is the reversed IP 
 address; if the hight bit is stripped, it's the non-reversed IP address 
 (without the high bit)? Doesn't it cause problems with an IPv4 class A 
 address? Now when is the high-bit stripped?

Hrm. There was a bug (bnc#806634) relating to how the high bit was
stripped, and that was fixed for SP3.

Were these nodes by chance upgraded from SP2 (with not all updates
applied) directly to SP3?

Was the corosync.conf option clear_node_high_bit changed?


Regards,
Lars

-- 
Architect Storage/HA
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Jennifer Guild, Dilip 
Upmanyu, Graham Norton, HRB 21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Support for DRDB

2015-01-17 Thread Lars Marowsky-Bree
On 2015-01-16T16:25:15, EXTERNAL Konold Martin (erfrakon, RtP2/TEF72) 
external.martin.kon...@de.bosch.com wrote:

 I am glad to hear that SLE HA has no plans to drop support for DRBD.
 
 Unfortunately I currently cannot disclose who is spreading this false 
 information.

Too bad. Do let them know they're quite wrong though ;-)


Regards,
Lars

-- 
Architect Storage/HA
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Jennifer Guild, Dilip 
Upmanyu, Graham Norton, HRB 21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: multipath sbd stonith device recommended configuration

2015-01-16 Thread Lars Marowsky-Bree
On 2015-01-16T08:11:48, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

 Hi!
 
 MHO: The correct time to wait is in an interval bounded by these two values:
 1: An I/O delay that may occur during normal operation that is never allowed 
 to trigger fencing
 2: The maximum value to are willing to accept to wait for fencing to occur
 
 Many people thing making 1 close to zero and 2 as small as possible is the 
 best solution.
 
 But imagine one of your SBD disks has some read problem, and the operation 
 has be be retried a few times. Or think about online upgrading your disk 
 firmware, etc.: Usually I/Os are stopped for a short time (typically less 
 than one minute).

Newer versions of SBD are less affected by this (and by newer, I mean
about 2 years ago). sbd uses async IO and every IO request is timed
out individually; so IO no longer gets stuck. As long as one read gets
through within the watchdog period, you're going to be OK.

In addition to that, I'd strongly recommend enabling the pacemaker
integration, which (since it was new functionality) couldn't be
auto-enabled on SLE HA 11, but is standard on SLE HA 12. On SLE HA 11,
it needs to be enabled using the -P switch.

Then you can enjoy shorter timeouts for SBD and thus lower fail-over
latencies even on a single MPIO device.

 Unfortunately the SBD syntax is a real mess, and there is not manual page 
 (AFAIK) for SBD.

... because man sbd isn't obvious enough, I guess. ;-)

 YOu can change the SBD parameters (on disk) online, but to be
 effective, the SBD daemon has to be restarted.

Yes. Online change of parameters is not supported. You specify them at
create time, not all of them can be overridden by commandline arguments.
You should not need to tune them in newer versions.


Regards,
Lars

-- 
Architect Storage/HA
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Jennifer Guild, Dilip 
Upmanyu, Graham Norton, HRB 21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: Antw: multipath sbd stonith device recommended configuration

2015-01-16 Thread Lars Marowsky-Bree
On 2015-01-16T12:22:57, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

  Unfortunately the SBD syntax is a real mess, and there is not manual page 
  (AFAIK) for SBD.
  ... because man sbd isn't obvious enough, I guess. ;-)
 OK, I haven't re-checked recently: You added one!

Yes, we 'recently' added one - June 2012 ... ;-)



-- 
Architect Storage/HA
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Jennifer Guild, Dilip 
Upmanyu, Graham Norton, HRB 21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Support for DRDB

2015-01-16 Thread Lars Marowsky-Bree
On 2015-01-16T11:56:04, EXTERNAL Konold Martin (erfrakon, RtP2/TEF72) 
external.martin.kon...@de.bosch.com wrote:

 I have been told that support for DRBD is supposed to be phased out from both 
 SLES and RHEL in the near future.

This is massively incorrect for SLE HA. (drbd is part of the HA add-on,
not SLES.) We have absolutely no such plans, and will continue to
support drbd as part of our offerings.

Where did you hear that?


Regards,
Lars

-- 
Architect Storage/HA
SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Jennifer Guild, Dilip 
Upmanyu, Graham Norton, HRB 21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] SLES11 SP3 compatibility with HP Data Protector 7' Automatic Desaster Recovery Module

2014-12-04 Thread Lars Marowsky-Bree
On 2014-12-04T08:12:28, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

 Of course HP's software isn't quite flexible here, but maybe a symlink from 
 the old location to the new one wouldn't be bad (for the lifetime of SLES11, 
 maybe)...

A symlink might not work, depending on what kind of information the tool
wishes to retrieve.

Also, reading the on-disk file directly is not recommended; it should
always be queried via the cibadmin/crm tools.

 Opinions? Any HP guy listening?

Alas, I'm afraid you need to raise a support incident - with HP.



Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 
(AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-ha-dev] Ordering of clones; does it work?

2014-11-27 Thread Lars Marowsky-Bree
On 2014-11-27T10:10:47, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

 Hi!
 
 I had thought ordrring of clones would work, but it looks like it does not in 
 current SLES11 SP3 (1.1.11-3ca8c3b):
 I have rules like:
 order ord_DLM_O2CB inf: cln_DLM cln_O2CB
 order ord_DLM_cLVMd inf: cln_DLM cln_cLVMd
 colocation col_O2CB_DLM inf: cln_O2CB cln_DLM
 colocation col_cLVMd_DLM inf: cln_cLVMd cln_DLM
 
 I thought the instances on one node would start DLM first, then in parallel 
 (optimally) O2CB and cLVMd.

Correct.

 When adding a new node to the cluster, all three instances were started at 
 the same time:
 Nov 25 12:37:18 h05 pengine[15681]:   notice: LogActions: Start  prm_cLVMd:2  
 (h10)
 Nov 25 12:37:18 h05 pengine[15681]:   notice: LogActions: Start  
 prm_LVM_CFS_VMs:2(h10)
 Nov 25 12:37:18 h05 pengine[15681]:   notice: LogActions: Start  prm_O2CB:2   
 (h10)
 Nov 25 12:37:18 h05 pengine[15681]:   notice: LogActions: Start  
 prm_CFS_VMs_fs:2 (h10)

This is the PE *scheduling* the resources to be started. It's not the
actual ordering. It's just telling you that the resources will be
started during this transition.

You need to check either the transition graph (crm configure simulate),
or look at the actual sequence of commands the LRM executes.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 
(AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-HA] [ha-wg] [Pacemaker] [Cluster-devel] [RFC] Organizing HA Summit 2015

2014-11-26 Thread Lars Marowsky-Bree
On 2014-11-25T16:46:01, David Vossel dvos...@redhat.com wrote:

Okay, okay, apparently we have got enough topics to discuss. I'll
grumble a bit more about Brno, but let's get the organisation of that
thing on track ... Sigh. Always so much work!

I'm assuming arrival on the 3rd and departure on the 6th would be the
plan?

  Personally I'm interested in talking about scaling - with pacemaker-remoted
  and/or a new messaging/membership layer.
 If we're going to talk about scaling, we should throw in our new docker 
 support
 in the same discussion. Docker lends itself well to the pet vs cattle 
 analogy.
 I see management of docker with pacemaker making quite a bit of sense now 
 that we
 have the ability to scale into the cattle territory.

While we're on that, I'd like to throw in a heretic thought and suggest
that one might want to look at etcd and fleetd.

  Other design-y topics:
  - SBD

Point taken. I have actually not forgotten this Andrew, and am reading
your development. I probably just need to pull the code over ...

  - degraded mode
  - improved notifications
  - containerisation of services (cgroups, docker, virt)
  - resource-agents (upstream releases, handling of pull requests, testing)
 
 Yep, We definitely need to talk about the resource-agents.

Agreed.

  User-facing topics could include recent features (ie. pacemaker-remoted,
  crm_resource --restart) and common deployment scenarios (eg. NFS) that
  people get wrong.
 Adding to the list, it would be a good idea to talk about Deployment
 integration testing, what's going on with the phd project and why it's
 important regardless if you're interested in what the project functionally
 does.

OK. So QA is within scope as well. It seems the agenda will fill up
quite nicely.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 
(AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] [ha-wg] [RFC] Organizing HA Summit 2015

2014-11-25 Thread Lars Marowsky-Bree
On 2014-11-24T16:16:05, Fabio M. Di Nitto fdini...@redhat.com wrote:

  Yeah, well, devconf.cz is not such an interesting event for those who do
  not wear the fedora ;-)
 That would be the perfect opportunity for you to convert users to Suse ;)

  I´d prefer, at least for this round, to keep dates/location and explore
  the option to allow people to join remotely. Afterall there are tons of
  tools between google hangouts and others that would allow that.
  That is, in my experience, the absolute worst. It creates second class
  participants and is a PITA for everyone.
 I agree, it is still a way for people to join in tho.

I personally disagree. In my experience, one either does a face-to-face
meeting, or a virtual one that puts everyone on the same footing.
Mixing both works really badly unless the team already knows each
other.

  I know that an in-person meeting is useful, but we have a large team in
  Beijing, the US, Tasmania (OK, one crazy guy), various countries in
  Europe etc.
 Yes same here. No difference.. we have one crazy guy in Australia..

Yeah, but you're already bringing him for your personal conference.
That's a bit different. ;-)

OK, let's switch tracks a bit. What *topics* do we actually have? Can we
fill two days? Where would we want to collect them?


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 
(AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] [ha-wg] [ha-wg-technical] [RFC] Organizing HA Summit 2015

2014-11-24 Thread Lars Marowsky-Bree
On 2014-11-11T09:17:56, Fabio M. Di Nitto fdini...@redhat.com wrote:

Hey,

I know I'm a bit late to the game, but: I'd be happy to meet, yet Brno
is not all that easy to reach. There don't appear to be regular flights
to BRQ, and it's also quite far by train.

Am I missing something obvious regarding public transport?

If everyone is travelling to devconf through Prague, maybe that'd be a
better place to meet? For those of us not going to devconf, that'd be a
quite shorter ride.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 
(AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] [ha-wg] [ha-wg-technical] [RFC] Organizing HA Summit 2015

2014-11-24 Thread Lars Marowsky-Bree
On 2014-11-24T06:59:39, Digimer li...@alteeve.ca wrote:

 The LINBIT folks suggested to land in Vienna and then it's two hours by
 road, but I've not looked too closely at it just yet.

I'd be happy to meet in Vienna. I'm not keen on first flying to VIE and
then spending 2+h on the road/bus.

(Vienna is actually a lot better for me than even Prague; direct
flights ;-)

 Once we get confirmation that the meeting is on, perhaps a pre-meeting
 in a convenient city for $drinks before travel would be nice. There
 must be a train available, too... I mean, in the west, we hear always
 about Europe's excellent public transit... ;)

Well, yes, there is a train, but from Berlin, it's almost 8 hours.

If everyone is routing through either VIE or PRG, it'd just be a lot
saner to have this meeting there; the RHT folks can always travel on.

Frankly, Brno is reasonably inconvenient for international travellers.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 
(AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] [ha-wg] [RFC] Organizing HA Summit 2015

2014-11-24 Thread Lars Marowsky-Bree
On 2014-09-08T12:30:23, Fabio M. Di Nitto fdini...@redhat.com wrote:

Folks, Fabio,

thanks for organizing this and getting the ball rolling. And again sorry
for being late to said game; I was busy elsewhere.

However, it seems that the idea for such a HA Summit in Brno/Feb 2015
hasn't exactly fallen on fertile grounds, even with the suggested
user/client day. (Or if there was a lot of feedback, it wasn't
public.)

I wonder why that is, and if/how we can make this more attractive?

Frankly, as might have been obvious ;-), for me the venue is an issue.
It's not easy to reach, and I'm theoretically fairly close in Germany
already.

I wonder if we could increase participation with a virtual meeting (on
either those dates or another), similar to what the Ceph Developer
Summit does?

Those appear really productive and make it possible for a wide range of
interested parties from all over the world to attend, regardless of
travel times, or even just attend select sessions (that would otherwise
make it hard to justify travel expenses  time off).


Alternatively, would a relocation to a more connected venue help, such
as Vienna xor Prague?


I'd love to get some more feedback from the community.

As Fabio put it, yes, I *can* suck it up and go to Brno if that's where
everyone goes to play ;-), but I'd also prefer to have a broader
participation.



Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 
(AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] [ha-wg] [RFC] Organizing HA Summit 2015

2014-11-24 Thread Lars Marowsky-Bree
On 2014-11-24T15:54:33, Fabio M. Di Nitto fdini...@redhat.com wrote:

 dates and location were chosen to piggy-back with devconf.cz and allow
 people to travel for more than just HA Summit.

Yeah, well, devconf.cz is not such an interesting event for those who do
not wear the fedora ;-)

 I´d prefer, at least for this round, to keep dates/location and explore
 the option to allow people to join remotely. Afterall there are tons of
 tools between google hangouts and others that would allow that.

That is, in my experience, the absolute worst. It creates second class
participants and is a PITA for everyone.

I know that an in-person meeting is useful, but we have a large team in
Beijing, the US, Tasmania (OK, one crazy guy), various countries in
Europe etc.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 21284 
(AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-ha-dev] RFC: pidfile handling; current worst case: stop failure and node level fencing

2014-10-24 Thread Lars Marowsky-Bree
On 2014-10-23T20:36:38, Lars Ellenberg lars.ellenb...@linbit.com wrote:

 If we want to require presence of start-stop-daemon,
 we could make all this somebody elses problem.
 I need find some time to browse through the code
 to see if it can be improved further.
 But in any case, using (a tool like) start-stop-daemon consistently
 throughout all RAs would improve the situation already.
 
 Do we want to do that?
 Dejan? David? Anyone?

I'm showing my age, but Linux FailSafe had such a tool as well. ;-) So
that might make sense.

Though in Linux nowadays, I wonder if one might not directly want to add
container support to the LRM, or directly use systemd. With a container,
all processes that the RA started would be easily tracked.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-HA] Q: ocf-tester

2014-09-09 Thread Lars Marowsky-Bree
On 2014-09-09T16:03:04, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

 I modified the ping RA to meet my needs, and then I used ocf-tester to check 
 it with the settings desired. I'm wondering about the output; shoudln't 
 ocf-tester query the metadata _before_ trying to use the methods, i.e.: Don't 
 use methods the RA doesn't announce:

It's checking if it's getting the proper exit code for unsupported
actions. That's expected.

 This just gives an ugly output ;-) Without -v the output loks better:

Yes, so don't use -v if you don't want verbose output ;-)

 When I try to use -L (worked on an older release), I get:
 --
 Using lrmd/lrmadmin for all tests
 could not start lrmd
 --
 What could be wrong?  (pacemaker-1.1.10-0.15.25 of SLES11 SP3)

Newer versions of pacemaker/cluster-glue no longer have a separate LRM,
hence this option in ocf-tester is no longer working there. This is
expected.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] getting proper sources

2014-06-07 Thread Lars Marowsky-Bree
On 2014-06-07T16:13:05, Dimitri Maziuk dmaz...@bmrb.wisc.edu wrote:

  So I'd appreciate it if you'd not make those claims; I admit to feeling
  slighted.
 The claim that prompted this was that the level of support a centos user
 gets is for pacemaker: 50% chance that the Lars over there will ask if
 he's a paying SuSe customer;

Why on earth would I ask a CentOS user if they're a SUSE customer?

 for heartbeat: 100% chance that Digimer will tell him to install
 pacemaker.

... which is smart, because unless they are paying someone for support
;-), the community likely won't be helping them much, hence they should
upgrade.

EOD. This is pointless.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] getting proper sources

2014-06-06 Thread Lars Marowsky-Bree
On 2014-05-31T11:15:20, Dmitri Maziuk dmaz...@bmrb.wisc.edu wrote:

 Is there a reason you keep spouting nonsense?
 Yes: I have a memory and it remembers. For example, this:
 http://www.gossamer-threads.com/lists/linuxha/users/81573?do=post_view_threaded#81573
 I don't remember that being an isolated incident either.

Uhm. I've just now stumbled across this wonderful discussion, and I was
torn a bit before responding.

That thread *was* an isolated incident. See, the whole business model
for SLES (and RHEL, for that matter) is that customers pay for support
and maintenance. There's several pricing levels from maintenance
updates only to your personal engineer on-call 24/7. If you decide to
run SLES, you don't get to report a bug against a random openSUSE
version (for free) and expect it to be fixed in SLES (for free). That's
like expecting bugs reported against Fedora XY to be fixed in RHEL N.
Etc.

The whole *point* of the enterprise distributions is that they come with
support and certifications (the two usually being tied together).


And yes, it is true: occasionally I have asked people whom I know to be
SUSE customers to report bugs via the official channels. That
streamlines the process, allows us to get fixes to them officially too
(and in a way that doesn't leave their poor support engineer wondering
what version of software their customer is running and where they got
it), and also happens to allow me to demonstrate to my boss I'm doing
actual work instead of just lounging in community mailing lists ;-) To
the outside world, this could probably look like what you describe, but
the truth is a bit different.

Part of the reason for this is that it is unrealistic to expect the free
community to support the often patched versions of code that end up on
the enterprise distributions. (Try reporting a bug in the RHEL/SLES
kernel on LKML and see what response arrives ...) We take
responsibility for that - because otherwise, we annoy upstream. And bugs
specific to the Enterprise releases tend to lack meaning for the
community. If the issue is reproduced on upstream-latest, we'll also
work with upstream on that.

So I'd appreciate it if you'd not make those claims; I admit to feeling
slighted.

If you're running openSUSE Factory with the latest builds of everything
HA from network:ha-clustering:Factory on openSUSE (something I highly
recommend! Latest! Greatest! Bleeding edge! And if anything isn't,
branch + submitrequest your updates! Maintainers and contributors
accepted!), and hit an issue, we'll all happily discuss that right here
on the mailing lists or in our open bugzillas ;-)


Have a great weekend,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Multiple IP address on ocf:heartbeat:IPaddr2

2014-06-05 Thread Lars Marowsky-Bree
On 2014-06-05T07:41:18, Teerapatr Kittiratanachai maillist...@gmail.com wrote:

 In my /usr/lib/ocf/resource.d/heartbeat/ directory doesn't has the `IPv6addr`
 agent.
 But I found that the `IPaddr2` agent also support IPv6, from the source
 code in GitHub (IPaddr2
 https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/IPaddr2,
 Line 129
 https://github.com/ClusterLabs/resource-agents/blob/master/heartbeat/IPaddr2#L129
 ).
 So I think that the `IPv6addr` agent doesn't be needed, but I still cannot
 use one resource for both IPv4 and IPv6 addresses.
 Anyway, thank you very much for you help.

No, you cannot.

You need to configure two resources for this. Is there any problem with
that?

You can use the same resource agent multiple times.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Heartbeat Supported Version

2014-06-02 Thread Lars Marowsky-Bree
On 2014-06-02T20:37:59, Venkata G Thota venkt...@in.ibm.com wrote:

 Hello,
 
 In our project we had the heartbeat cluster with version 
 heartbeat-2.1.4-0.24.9.
 
 Is it the supported version ?
 
 Kindly assist how to get support for heartbeat cluster issues.
 Regards

That looks like a fairly old heartbeat version from SUSE Linux
Enterprise Server 10 SP4.

SLES 10 is out of general support since July 2013, but extended support
(https://www.suse.com/support/lc-faq.html#2) or LTSS is still
available.

Alternatively, the best option you'd have is to upgrade to SLES + HA 11
SP3.



Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Heartbeat Supported Version

2014-06-02 Thread Lars Marowsky-Bree
On 2014-06-02T12:04:23, Digimer li...@alteeve.ca wrote:

 You should email Linbit (http://linbit.com) as they're the company that
 still supports the heartbeat package.

For completeness, I doubt Linbit will support this version, since 2.1.4
from SLES 10 contains a number of backports from the pacemaker 0.7/1.0
series. While source code is obviously available, I'd not suggest to
inflict this on Linbit ;-)


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] SBD flipping between Pacemaker: UNHEALTHY and OK

2014-04-25 Thread Lars Marowsky-Bree
On 2014-04-22T14:21:33, Tom Parker tpar...@cbnco.com wrote:

Hi Tom,

 Has anyone seen this?  Do you know what might be causing the flapping?

No, I've never seen this.

 Apr 21 22:03:04 qaxen6 sbd: [12974]: info: Waiting to sign in with
 cluster ...

So it connected fine. This is the process maintaining the pcmk
connection, so the others can be disregarded.

 Apr 21 22:03:06 qaxen6 sbd: [12974]: info: We don't have a DC right now.
 Apr 21 22:03:08 qaxen6 sbd: [12974]: WARN: Node state: UNKNOWN
 Apr 21 22:03:09 qaxen6 sbd: [12974]: info: Node state: online
 Apr 21 22:03:10 qaxen6 sbd: [12974]: WARN: Node state: pending
 Apr 21 22:03:11 qaxen6 sbd: [12974]: info: Node state: online
 Apr 21 22:15:01 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!
 Apr 21 22:16:37 qaxen6 sbd: [12974]: info: Node state: online
 Apr 21 22:25:08 qaxen6 sbd: [12974]: WARN: AIS: Quorum outdated!

Is this all that is happening here? 

Judging from this, there should be an unstable pacemaker cluster to go
with this.

Are there any crmd/corosync etc messages?


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: Q: NTP Problem after Xen live migration

2014-04-17 Thread Lars Marowsky-Bree
On 2014-04-17T08:05:43, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

 I think Xen live migration and correct time has a lot to do with HA; maybe
 not with the product you have in mind, but with the concept in general.

Sure. But Xen and kernel time keeping developers aren't subscribed to
linux-ha, so this isn't the best list to report or discuss such issues.

 For the problem: I'll retry soo with all the current updates being installed.

Thanks!


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Q: NTP Problem after Xen live migration

2014-04-16 Thread Lars Marowsky-Bree
On 2014-04-16T09:18:21, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

 As it turn out, the time in the VMs is actually wrong after migration:
 # ntpq -pn
  remote   refid  st t when poll reach   delay   offset  jitter
 ==
  127.127.1.0 .LOCL.  10 l   51   64  3770.0000.000   0.001
 +132.199.176.18  192.53.103.108   2 u   71 1024  3770.362  -77840.   2.573
 +132.199.176.153 192.53.103.104   2 u  386 1024  3770.506  -77838.   2.834
 *172.20.16.1 132.199.176.153  3 u  391 1024  3774.140  -77840.   2.744
 +172.20.16.5 132.199.176.18   3 u  386 1024  3773.767  -77841.   1.409
 
 If I want to continue to use NTP, what are the recommendations?

Probably a kernel or hypervisor bug. If these are SLES systems, please
report (with full versions etc) to NTS, but has little to do with HA.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-ha-dev] pull request for sg_persist new RA ocft

2014-03-21 Thread Lars Marowsky-Bree
On 2014-03-18T02:24:51, Liuhua Wang lw...@suse.com wrote:

Hi Liuhua,

thanks for pushing again!

I've taken some time to provide some code review. Overall, I think it
looks good, mostly cosmetic and codingstyle.

I'd welcome more insight from others on this list; especially those with
maintainer access to github's resource-agents repository ;-)


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-HA] How to tell pacemaker to process a new event during a long-running resource operation

2014-03-17 Thread Lars Marowsky-Bree
On 2014-03-14T15:50:18, David Vossel dvos...@redhat.com wrote:

 in-flight operations always have to complete before we can process a new 
 transition.  The only way we can transition earlier is by killing the 
 in-flight process, which results in failure recovery and possibly fencing 
 depending on what operation it is.
 
 There's really nothing that can be done to speed this up except work on 
 lowering the startup time of that resource.

We keep getting similar requests though - some services take a long
time, and during that interval, the cluster is essentially stuck. As the
density of resources in the cluster increases and the number of nodes
goes up, this becomes more of an issue.

It *would* be possible with changes to the TE/PE - assume in-flight
operations will complete as planned, so that any further changes to
in-flight resources would be ordered after them, the ability to accept
actions completing from previous transitions -, but it's also
non-trivial.

Pacemaker 2.0 material? ;-)


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: resource-agents: exportfs: Unlocking filesystems on stop by default

2014-03-11 Thread Lars Marowsky-Bree
On 2014-03-11T12:37:39, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

 I'm wondering: Does unlock mean that all file locks are invalidated? If so,
 I think it's a bad idea, because for a migration of the NFS server the exports
 will be stopped/started, thus loosing all locks. That's not what a HA
 NFS-Server should do. HA NFS-Servers try hard to preserve locks.

No. Specifically, this feature was explicitly designed for HA take-over:
http://people.redhat.com/rpeterso/Patches/NFS/NLM/004.txt



Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] 2 Nodes split brain, distant sites

2014-02-28 Thread Lars Marowsky-Bree
On 2014-02-27T11:05:21, Digimer li...@alteeve.ca wrote:

   So regardless of quorum, fencing is required. It is the only way to
 reliably avoid split-brains. Unfortunately, fencing doesn't work on stretch
 clusters.

For a two node stretch cluster, sbd can also be used reliably as a
fencing mechanism.

It essentially uses the standard iSCSI protocol as a quorum mechanism.
Export one (1MB or so) iSCSI LU from each site to the other, and in the
best case, host one at a 3rd site as tie-breaker. Then run SBD across
these.

booth is striving to address even longer distances, where each site is a
truly separate cluster (e.g., independent corosync/pacemaker setups,
totem not running across the gap).



Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] 2 Nodes split brain, distant sites

2014-02-28 Thread Lars Marowsky-Bree
On 2014-02-28T13:16:33, Digimer li...@alteeve.ca wrote:

 Assuming a SAN in each location (otherwise you have a single point of
 failure), then isn't it still possible to end up with a split-brain if/when
 the WAN link fails?

As I suggested a 3rd tie-breaker site (which, in the case of SBD, can be
any old iSCSI LU somewhere), no, this can't happen. One site would fence
the other.

(The same is true, if a bit more elegantly, for booth.)

 Something (drbd?) is going to be keeping the data in sync between the
 locations. If both assume the other is dead, sure each location's SAN will
 block the other node, but then each location will proceed independently and
 their data will diverge, right?

For a two site setup, perhaps. A manual fail-over would need to make
sure that the other site is really stopped.

Assuming two sites, the currently active site would not be able to
commit new transactions as long as the WAN link is down and the other
site has not been declared dead (through automatic or manual fence
confirmation).

Also, any replication mechanism includes a means to pick a winner and
sync over such divergent changes, should they occur. Which they
shouldn't.

(Asynchronous replication on the other hand always has a RPO  0, and
can always risk losing a few transactions. That is the nature of
disaster recovery. Hopefully, disasters are rare.)



Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Why wont ocf:heartbeat:Xen check the current status?

2014-02-22 Thread Lars Marowsky-Bree
On 2014-02-22T12:35:42, ml ml mliebher...@googlemail.com wrote:

 Hello List,
 
 i have a two node Cluster with Debian 7 and this configuation:
 
 node proxy01-example.net
 node proxy02-example.net
 primitive login.example.net ocf:heartbeat:Xen \
 params xmfile=/etc/xen/login.example.net.cfg \
 op monitor interval=30s timeout=600 \
 op start interval=0 timeout=60 \
 op stop interval=0 timeout=40 \
 meta target-role=Started allow-migrate=true is-managed=false
 property $id=cib-bootstrap-options \
 dc-version=1.1.7-ee0730e13d124c3d58f00016c3376a1de5323cff \
 cluster-infrastructure=openais \
 expected-quorum-votes=2 \
 stonith-enabled=false \
 no-quorum-policy=ignore \
 last-lrm-refresh=1393067803
 rsc_defaults $id=rsc-options \
 resource-stickiness=100
 
 The resource login.example.net is already running because this is a
 existing setup and i would like to add the linux-ha stuff.
 
 So lets say login.example.net is already running on proxy01-example.net.
 
 If i now activate this configuration, the Xen Instance
 login.example.netwill be started on the
 proxy02-example.net node.
 
 Now i have then running twice!!
 
 Why does it not check IF the resource is already running somewhere?

It does.

It also obviously isn't working for you.

The RA tries to determine the name of the domain to look for from the
configuration file; perhaps that's not working for you? You can manually
set that for the resource too.

The log messages from the agent are probably enlightening, or try
running it manually with monitor and see why it doesn't pick up. The
answers are always in the logs (or if that fails, in the code ;-).

On the other hand, you don't really care if guests run twice, right?
Otherwise you'd not have disabled stonith ...


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] resources don't migrate on failure of one node (in a two node cluster)

2014-02-22 Thread Lars Marowsky-Bree
On 2014-02-22T13:49:40, JR botem...@gmail.com wrote:

 I've been told by folks on the linux-ha IRC that fencing is my answer
 and I've put in place the null fence client.  I understand that this is
 not what I'd want in production, but for my testing it seems to be the
 correct way to test a cluster.  I've confirmed in the good server's logs
 that it believes it has successfully fenced its partner

Well, as long as you never do this for production, yes, I guess this
should work.

But I doubt anyone has tested if the null stonith agent really works.
Perhaps it is, in fact, too fast and you're hitting a strange race. Or
perhaps something else is going wrong, such as no-quorum-policy not
being set properly. It's impossible to tell without your configuration
or logs, which always hold the answers ;-)


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: Q: crm configure edit/show regex

2014-02-19 Thread Lars Marowsky-Bree
On 2014-02-19T10:31:45, Andrew Beekhof and...@beekhof.net wrote:

  Unifying this might be difficult, as far as I know pcs doesn't have an
  interactive mode or anything similar to the configure interface of
  crmsh..
 It does have bash completion for the command line.

FWIW, so does the crm shell ;-) (It works just like in the interactive
mode too, probably the fully internal mode is a bit faster, since it
avoids bash stuff.)


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Why does o2cb RA remove module ocfs2?

2014-02-05 Thread Lars Marowsky-Bree
On 2014-02-05T12:24:00, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

 I had a problem where O2CB stop fenced the node that was shut down:
 I had updated the kernel, and then rebooted. As part of shutdown, the cluster 
 stack was stopped. In turn, the O2CB resource was stopped.
 Unfortunately this caused an error like (SLES11 SP3):
 
 ---
 modprobe: FATAL: Could not load /lib/modules/3.0.101-0.8-xen/modules.dep: No 
 such file or directory
 o2cb(prm_O2CB)[19908]: ERROR: Unable to unload module: ocfs2
 ---
 
 This in turn caused a node fence, which ruined the clean reboot.
 
 So why is the RA messing with the kernel module on stop?

Because customers complained about the new module not being picked up if
they upgrade ocfs2-kmp and restarted the cluster stack on a node. It's
incredibly hard to please everyone, alas ...

The right way to update a cluster node is anyway this one:

1. Stop the cluster stack
2. Update/upgrade/reboot as needed
3. Restart the cluster stack

This would avoid this error too. Or keeping multiple kernel versions in
parallel (which also helps if a kernel update no longer boots for some
reason). Removing the running kernel package is usually not a great
idea; I prefer to remove them after having successfully rebooted only,
because you *never* know if you may have to reload a module.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: Why does o2cb RA remove module ocfs2?

2014-02-05 Thread Lars Marowsky-Bree
On 2014-02-05T15:06:47, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

 I guess the kernel update is more common than the just the ocfs2-kmp update

Well, some customers do apply updates in the recommended way, and thus
don't encounter this ;-) In any case, since at this time the cluster
services are already stopped, at least the service impact is minimal.

  This would avoid this error too. Or keeping multiple kernel versions in
  parallel (which also helps if a kernel update no longer boots for some
  reason). Removing the running kernel package is usually not a great
  idea; I prefer to remove them after having successfully rebooted only,
  because you *never* know if you may have to reload a module.
 
 There's another way: (Like HP-UX learned to do it): Defer changes to the
 running kernel until shutdown/reboot.

True. Hence: activate multi-versions for the kernel in
/etc/zypp/zypp.conf and only remove the old kernel after the reboot. I
do that manually, but I do think we even have a script for that
somewhere. I honestly don't remember where though; I like to keep
several kernels around for testing anyway.

I think this is the default going forward, but as always: zypper gained
this ability during the SLE 11 cycle, and we couldn't just change
existing behaviour in a simple update, it has to be manually
activated.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: crm configure stonith:external/vcenter - parameter VI_CREDSTORE does not exist

2014-01-30 Thread Lars Marowsky-Bree
On 2014-01-30T12:19:27, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

  root@vm-nas1:~# crm ra info fencing_vcenter stonith:external/vcenter
  ERROR: stonith:external/vcenter:fencing_vcenter: could not parse meta-data:
 I guess your RA may be LSB (which is kind of obsolete).

Hm? How can a stonith-external fencing agent be LSB? (Not talking
about RAs here, either.)

  crm(live)ra# list stonith
  apcmaster   apcmastersnmp   apcsmartbaytech bladehpi
   
 cycladesdrac3
  fence_legacyfence_pcmk  ibmhmc  ipmilan meatware
   
 nullnw_rpc100s
  rcd_serial  rps10   ssh suicide wti_mpc 
   
 wti_nps
  
  the external type is missing, is this correct?

Yes, this should include the external/* types. (It does for me, but
then, I'm on openSUSE / SLE HA.)

  root@vm-nas1:~# stonith -L | grep external
  external/drac5
[...]

So they are packaged, but not picked up by the crm shell. Seems that
there's a packaging error in how agents are queried, but I admit to not
knowing which one - I've not seen this behaviour before.

  on my Debian/Jessie with pacemaker 1.1.7 / cluster-glue 1.0.10 i run into 
  the
  following problem:

It's probably best to update to latest releases - pacemaker-1.1.10 (or
even the 1.1.11 candidate) etc.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: crm configure stonith:external/vcenter - parameter VI_CREDSTORE does not exist

2014-01-30 Thread Lars Marowsky-Bree
On 2014-01-31T07:59:57, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

   root@vm-nas1:~# crm ra info fencing_vcenter stonith:external/vcenter
   ERROR: stonith:external/vcenter:fencing_vcenter: could not parse
 meta-data:
  I guess your RA may be LSB (which is kind of obsolete).
  
  Hm? How can a stonith-external fencing agent be LSB? (Not talking
  about RAs here, either.)
 
 Sorry, I didn't look carefully. OTOH, why doesn't it provide valid metadata?

Probably a packaging error that the external plugin isn't being properly
found on Debian.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: /usr/sbin/lrmadmin missing from cluster-glue

2014-01-27 Thread Lars Marowsky-Bree
On 2014-01-27T08:59:55, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

 Talking on node-action-limit: I think I read in the syslog (not the best way
 to document changes) that the migration-limit parameter is obsoleted by
 node-action-limit in lastest SLES. Ist that correct?

No.

 crmd[16709]:  warning: cluster_option: Using deprecated name 'migration-limit'
 for cluster option 'node-action-limit'

This message is misleading, sorry :-( Already fixed.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] /usr/sbin/lrmadmin missing from cluster-glue

2014-01-25 Thread Lars Marowsky-Bree
On 2014-01-24T10:52:56, Tom Parker tpar...@cbnco.com wrote:

 Thanks Kristoffer.
 
 How is tuning done for lrm now?

What do you want to tune?

The LRM_MAX_CHILDREN setting is still (okay: again ;-), that was broken
in one update) honored as before. Or you can use the node-action-limit
property in the CIB to achieve the same, without setting environment
variables; in case you don't like the automatic attempt at load
dampening that pacemaker deploys.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: heartbeat failover

2014-01-24 Thread Lars Marowsky-Bree
On 2014-01-24T08:16:03, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

 We have a server with a network traffic light on the front. With
 corosync/pacemaker the light is constantly flickering, even if the cluster
 does nothing.
 So I guess it's normal.

Yes. Totem and other components have on-going health checks, so there's
constant background traffic. But nothing major.

 you that traffic will increase if your configuration grows and if the cluster
 actually does something. I don't know if communicate like mad was a design
 concept, but on out HP-UX Service Guard cluster there was only the heartbeat
 traffic (configured to be one packet every 7 seconds (for good luck reasons
 ;-)) when the cluster was idle.

HP SG runs an entirely different protocol and is an entirely different
architecture.

 Obviously the cLVM mirroring causes the corosync problems:

I've never, ever observed this during testing and my continuously
running cLVM RAID setup. Looks like an overload during resync, though.
You need faster NICs ;-)


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] crmd (?) becomes unresponsive

2014-01-22 Thread Lars Marowsky-Bree
On 2014-01-22T09:55:10, Thomas Schulte tho...@cupracer.de wrote:

Hi Thomas,

since those are very recent upstream versions, I think you'll have a
better chance to ask directly on the pacemaker mailing list, or directly
report via bugs.clusterlabs.org - at least for providing the
attachments, that's the best option.


Best,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: crmd (?) becomes unresponsive

2014-01-22 Thread Lars Marowsky-Bree
On 2014-01-22T11:18:06, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

 We are living in a very distributed world, even when using one Linux
 distribution. Maybe those who know could post periodic reminders which 
 problems
 to post where...

I thought that was what I just did? This is likely a pacemaker problem,
and more pacemaker experts are subscribed to the clusterlabs list than
here. Also, the right bugzilla for the attachments/logs.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Q: fsck.ocfs anyone

2014-01-15 Thread Lars Marowsky-Bree
On 2014-01-15T12:05:22, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

Ulrich,

please either ask this question to support or at least on the ocfs2
mailing list.

We really can't provide enterprise-level support via a generic
mailing list. That is not a sustainable business model. And since SLE
contains code that is not upstream, the upstream mailing lists also
can't really help you.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] SLE11 SP3: attrd[13911]: error: plugin_dispatch: Receiving message body failed: (2) Library error: Success (0)

2014-01-15 Thread Lars Marowsky-Bree
On 2014-01-15T08:49:55, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

 I feel the current clusterstack for SLES11 SP3 has several problems. I'm 
 fighting for a day to get my test cluster up again after having installed the 
 latest updates. I still cannot find out what's going on, but I suspect there 
 are too many bugs in the software (again). For example I just saw these 
 messages:

Please raise your issues with support. We cannot provide support for
enterprise software via upstream mailing lists.

Also, your scenario description doesn't include enough context to debug
what's going on and why.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: Q: fsck.ocfs anyone

2014-01-15 Thread Lars Marowsky-Bree
On 2014-01-15T15:05:02, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

 My man at Novell knows about the issue, too ;-)

%s/Novell/SUSE/g

 I understand that Novell does not want to read about bugs in their products in
 mailinglists, just as customers don't want to see bugs in the products they 
 are
 using. Talking about them may improve the situation, while just being silent
 doesn't really help any of the two.

That's not really the issue. But that noone can really help you here.
It's like reporting a bug in SLES' (or RHEL's, for that matter) kernel
on LKML. They'll tell you to take it to your vendor.

And even if they wanted to, you're not providing enough context, not
enough configuration, etc. And OCFS2 is not the main focus on this list,
so all the OCFS2 experts read another.

And the people that could help you from SUSE won't do this here, because
- as you said - you also raise this via support. (Or maybe you don't,
because sometimes, you say it's not important enough, so we'd have to
double check.) So we use that channel to respond (us being dedicated to
the product's quality, getting paid to do so and all that); leaving the
mailing list either with a bad impression of a hanging discussion or
having to do the work/write-up twice.

The derisive style doesn't inspire a lot of motivation, either.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] 3 active nodes in 3 node configuration

2014-01-09 Thread Lars Marowsky-Bree
On 2014-01-09T22:38:17, erkin kabataş ekb...@gmail.com wrote:

 I am using  heartbeat-2.1.2-2.i386.rpm, heartbeat-pils-2.1.2-2.i386.
 rpm, heartbeat-stonith-2.1.2-2.i386.rpm packages on RHEL 5.5.

2.1.2? Seriously, upgrade. You're running code from 2007.

 I have 3 nodes and I only use cluster IP as resource. When I start
 heartbeat service on all nodes, node1 takes cluster IP. But, after about
 2-3 days all 3 nodes got cluster IP, at the same time.

The answer is probably in the logs. Probably some network failure
somewhere that causes split-brain, and because you don't have
fencing/STONITH configured, resources get activated multiple times.

Honestly, 2.1.2 is so long ago that we can't even remember all the bugs
it had ;-)


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] does heartbeat 3.0.4 use IP aliases under CentOS 6.5?

2014-01-04 Thread Lars Marowsky-Bree
On 2014-01-03T20:56:42, Digimer li...@alteeve.ca wrote:

 causing a lot of reinvention of the wheel. In the last 5~6 years, both teams
 have been working hard to unify under one common open-source HA stack.
 Pacemaker + corosync v2+ is the result of all that hard work. :)

Yes. We know finally have one stack everywhere. Yay!

To keep our spirits up, we just admin it completely differently ;-)



Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: Problem with updating the cluster stack (SLES11 SP3) (long)

2014-01-02 Thread Lars Marowsky-Bree
On 2014-01-02T10:17:13, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

  Are you using the update that was pushed on Friday?  The previous
 As I installed updates on Monday, I guess so ;-)

Hm, we're not yet aware of any new or existing bugs in that update. Of
course, we'll learn differently very soon ...

 Signal 6 looks like an bug I saw in the past. Could it be some fix is 
 missing in the mainline?

Not that we're aware of that.

The shutdown looks good.

  Dec 30 11:39:42 h01 cib[11884]:error: crm_abort: cib_common_callback: 
  Triggered fatal assert at callbacks.c:247 : flags  crm_ipc_client_response

Whoops. Not good. This looks like SR/bugzilla time with a hb_report
attached. Hopefully, you got a coredump from this.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] [Pacemaker] crmsh: New syntax for location constraints, suggestions / comments

2013-12-13 Thread Lars Marowsky-Bree
On 2013-12-13T10:16:41, Kristoffer Grönlund kgronl...@suse.com wrote:

 Lars (lmb) suggested that we might switch to using the { } - brackets
 around resource sets everywhere for consistency. My only concern with
 that would be that it would be a breaking change to the previous crmsh
 syntax. Maybe that is okay when going from 1.x to 2.0, but it also
 makes me a bit nervous. :)

Yes, but consistency and syntax cleanups are important for 2.0.

Also, then we may finally be able to change the order for collocation
constraints as well ;-)


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] [Pacemaker] crmsh: New syntax for location constraints, suggestions / comments

2013-12-13 Thread Lars Marowsky-Bree
On 2013-12-13T13:51:27, Andrey Groshev gre...@yandex.ru wrote:

 Just thought that I was missing in location, something like: node=any :)

Can you describe what this is supposed to achieve?

any is the default for symmetric clusters anyway.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: SLES11 SP2 HAE: problematic change for LVM RA

2013-12-04 Thread Lars Marowsky-Bree
On 2013-12-04T10:25:58, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

  You thought it was working, but in fact it wasn't. ;-)
 working meaning the resource started.
 not working meaning the resource does not start
 
 You see I have minimal requirements ;-)

I'm sorry; we couldn't possibly test all misconfigurations. So this
slipped through, we didn't expect someone to set that for a
non-clustered VG previously.

  You could argue that it never should have worked. Anyway: If you want
  to activate a VG on exactly one node you should not need cLVM; only if
  you man to activate the VG on multiple nodes (as for a cluster file
  system)...
  
  You don't need cLVM to activate a VG on exactly one node. Correct. And
  you don't. The cluster stack will never activate a resource twice.
 
 Occasionally two safty lines are better than one. We HAD filesystem
 corruptions due to the cluster doing things it shouldn't do.

And that's perfectly fine. All you need to do to activate this is
vgchange -c y on the specific volume group, and the exclusive=true
flag will work just fine.

  If you don't want that to happen, exclusive=true is not what you want to
  set.
 That makes sense, but what I don't like is that I have to mess with local
 lvm.conf files...

You don't. Just drop exclusive=true, or set the clustered flag on the
VG.

You only have to change anything in the lvm.conf if you want to use tags
for exclusivity protection (I defer to the LVM RA help for how to use
that, I've never tried it).


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: SLES11 SP2 HAE: problematic change for LVM RA

2013-12-03 Thread Lars Marowsky-Bree
On 2013-12-02T09:22:10, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

  No!
  
  Then it can't work. Exclusive activation only works for clustered volume
  groups, since it uses the DLM to protect against the VG being activated
  more than once in the cluster.
 Hi!
 
 Try it with resource-agents-3.9.4-0.26.84: it works; with 
 resource-agents-3.9.5-0.6.26.11 it doesn't work ;-)

You thought it was working, but in fact it wasn't. ;-)

Or at least, not as you expected.

 You could argue that it never should have worked. Anyway: If you want
 to activate a VG on exactly one node you should not need cLVM; only if
 you man to activate the VG on multiple nodes (as for a cluster file
 system)...

You don't need cLVM to activate a VG on exactly one node. Correct. And
you don't. The cluster stack will never activate a resource twice.

You need cLVM if you want LVM2 to enforce that at the LVM2 level -
because it does this by getting a lock on the VG/LV, since otherwise
LVM2 has no way of knowing if the VG/LV is currently active somewhere
else. And this is what exclusive=true turns on.

If you don't want that to happen, exclusive=true is not what you want to
set.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] SLES11 SP2 HAE: problematic change for LVM RA

2013-11-29 Thread Lars Marowsky-Bree
On 2013-11-29T12:05:28, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

 Hi!
 
 A short notice: We had a problem after updating the resource agents in SLES11 
 SP2 HAE: A LVM VG would not start after updating the RAs. The primitive had 
 exclusive=true for years, but the current RA requires a change in 
 /etc/lvm/lvm.conf, it seems. I didn't have time to investigate, so I just did 
 s/true/false/...

Was that a clustered volume?

It's really, really hard to do anything with reports like a problem
and 'would not start'.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: cib: [12226]: WARN: cib_diff_notify: Update (client: crmd, call:7137): -1.-1.-1 - 0.620.0 (The action/feature is not supported)

2013-11-29 Thread Lars Marowsky-Bree
On 2013-11-26T12:09:41, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

 I saw that I don't have an SBD device any more (it's stopped). Unfortunately I
 could not start it (crm resource start prm_stonith_sbd).
 I guess it's due to the fact that the cluster won't start resources until the
 UNCLEAN node has been fenced. The dog bites ist tail ,it seems...

No, this is a harmless visibility change. Pacemaker no longer needs the
device resource to be started before it can fence - stonithd now
directly reads the configuration from the CIB.

The device resource only signifies where the monitor ops will be run
from, but doesn't really impact the fencing. (Though you can still
disable a fence device by setting target-role=Stopped)


 The cluster is refusing to work:
 cib: [12243]: info: cib_process_diff: Diff 0.620.155 - 0.621.1 not applied to
 0.617.0: current epoch is less than required

This should trigger a full resync of the CIB. It seems you diverged
during a split-brain scenario. That it can't apply a diff is normal in
this case.

 Unfortunately and despite of the fact that o2 was shot, the cluster got a
 stonith timeout and retried the stonith!

This is a very detailed analysis, but it doesn't share any of the facts
(such as the configured timeouts).

 Could this (on the DC) be the reason?
 o4 stonith-ng[17787]:error: crm_abort: call_remote_stonith: Triggered
 assert at remote.c:973 : op-state  st_done
 stonith-ng[17787]:   notice: remote_op_timeout: Action reboot
 (97a0476a-7f1d-4986-ba68-0f0d88aeb764) for o2 (crmd.17791) timed out

Yeah, that looks annoying. As always, the best way to actually get
support would be to raise a support call.

(Mailing list activity does take a backseat to customer calls.)


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: SLES11 SP2 HAE: problematic change for LVM RA

2013-11-29 Thread Lars Marowsky-Bree
On 2013-11-29T13:46:17, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

  I just did s/true/false/...
  
  Was that a clustered volume?
 Clusterd  exclusive=true ??
 
 No!

Then it can't work. Exclusive activation only works for clustered volume
groups, since it uses the DLM to protect against the VG being activated
more than once in the cluster.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: SLES11 SP2 HAE: problematic change for LVM RA

2013-11-29 Thread Lars Marowsky-Bree
On 2013-11-29T13:48:33, Lars Marowsky-Bree l...@suse.com wrote:

   Was that a clustered volume?
  Clusterd  exclusive=true ??
  
  No!
 Then it can't work. Exclusive activation only works for clustered volume
 groups, since it uses the DLM to protect against the VG being activated
 more than once in the cluster.

Ah, that could be the actual reason.

SP3 has a new tagged mode for non-clustered volumes. If it detects
that you want to use a VG in exclusive mode, but it isn't clustered,
then it'll try to fall-back to that. That appears to not be working for
you properly.



Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: cib: [12226]: WARN: cib_diff_notify: Update (client: crmd, call:7137): -1.-1.-1 - 0.620.0 (The action/feature is not supported)

2013-11-26 Thread Lars Marowsky-Bree
On 2013-11-26T09:32:50, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

 Anotherthing I've notivced: One of out nodes has defective hardware and is
 down. It was OK all the time with SLES11 SP2, but SP3 now tried to fence the
 node and got a fencing timeout:

Hmmm.

 Isn't the logic that after a fencing operation timeout the node is considered
 to be OFFLINE?

No. A timeout is not a successful fence. The problem is why you're
getting the timeout in the first place now.

 My node currently has the state UNCLEAN (offline).
 
 How do I make an offline node clean? ;-)

You can run stonith_admin and manually ack the fence, that should work.

  What doesn't work is making changes, or at any point in time having only
  new version of the cluster up and then trying to rejoin an old one.
 Do you consider resource cleanups (crm_resource -C ...) as a change? These are
 essential in keeping the cluster happy, especially if you missed some ordering
 constraints.

No, as long as an old version is still around, that will be the DC and
the internal upgrade shouldn't happen.

I meant changes to the configuration that actually use new features. And
as soon as no more nodes running the old version are online, it'll
convert upwards too ...

But yes, we're already working on a new maintenance update for SP3 too.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] cib: [12226]: WARN: cib_diff_notify: Update (client: crmd, call:7137): -1.-1.-1 - 0.620.0 (The action/feature is not supported)

2013-11-25 Thread Lars Marowsky-Bree
On 2013-11-25T17:48:25, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

Hi Ulrich,

 Probably reason:
 cib: [12226]: ERROR: cib_perform_op: Discarding update with feature set 
 '3.0.7' greater than our own '3.0.6'
 
 Is it required to update the whole cluster at once?

It shouldn't be, and we tested that for sure. Rolling upgrades should be
possible.

What doesn't work is making changes, or at any point in time having only
new version of the cluster up and then trying to rejoin an old one.

It's hard to say what happened without knowing more details about your
update procedure.



Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Xen XL Resource Agent

2013-11-18 Thread Lars Marowsky-Bree
On 2013-11-15T09:05:53, Tom Parker tpar...@cbnco.com wrote:

 The XL tools are much faster and lighter weight.   I am not sure if they
 report proper codes (I will have to test) but the XM stack has been
 deprecated so at some point I assume it will go away completely.

The Xen RA already supports xen-list and xen-destroy in addition to
the xm tools. Patches to additionally support xl are welcome.

(Auto-detect what is available, and then choose xl - xen-* - xm.)

We can't yet drop xm, since not all environments have xl yet.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] ocf:heartbeat/ra's

2013-10-31 Thread Lars Marowsky-Bree
On 2013-10-31T10:54:15, Chuck Smith cgasm...@comcast.net wrote:

 I have been debugging ocf:heartbeat:anything, can someone point me to the
 definitive standards for ra handshake, as it appears there are several
 ported  legacy  methods that are inconsistent. Also, if you can point me to
 the top of tree / owner for the RA's, I'll be sure to contribute whatever
 changes I make that resolve the problems I'm seeing. (using dhcpd/named in
 foreground)

Hi Chuck,
 
I'm entirely unsure by what you mean with ra handshake. Can you be
more specific?

The repository is: https://github.com/ClusterLabs/resource-agents/

Includes documentation, but that's also available here:
http://www.linux-ha.org/doc/dev-guides/ra-dev-guide.html


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] RH and gfs2 and Pacemaker

2013-10-15 Thread Lars Marowsky-Bree
On 2013-10-15T14:15:50, Moullé Alain alain.mou...@bull.net wrote:

 in fact, I would like to know if someone has configured gfs2 under Pacemaker
 with the dlm-controld and gfs-controld from cman-3.0.12 rpm (so without any
 more the dlm-controld.pcml and gfs-controld.pcml)  ?
 
 And if it works fine with Pacemaker ?

It's not quite what you're asking, but on openSUSE Factory, we have GFS2
working with just dlm_controld+pacemaker+corosync 2.3.x, no more *.pcmk
etc. 

Works fine, it seems. I'd hope the same is true on the distribution
where GFS2 originated from ;-)


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] RH and gfs2 and Pacemaker

2013-10-15 Thread Lars Marowsky-Bree
On 2013-10-15T16:25:37, Moullé Alain alain.mou...@bull.net wrote:

 Hi Lars,
 thanks  a lot for information.
 I 'll try, but the documentation asks for gfs2-cluster rpm installation, and
 for now I don't find this rpm on RHEL6.4, and don't know
 if it is always required ... but it is not in your side ;-)

gfs2-cluster? I thought all that was needed was gfs2-utils.

Without the gfs2-controld anymore, I don't think you need that package.



Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: General question about heartbeat tokens and node overloaded.

2013-10-02 Thread Lars Marowsky-Bree
On 2013-10-02T09:36:14, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

 In general I'm afraid you cannot handle this situation in a perfect way:
 
 You have two types of problems:
 1) A node, resource, or monitor is hanging, but a long timeout prevents to
 recognize this in time
 2) A node, resource, or monitor is performing slower than usual, but a short
 timeout causes the cluster to think there is a problem with the
 node/resource/monitor

Yes, or to summarize, timeouts suck for failure detection, but for many
cases, we don't have anything better. Digging out my age old post:
http://advogato.org/person/lmb/diary/108.html

A massively overloaded system is indistinguishable from a failing or
hung one. On the plus side, if a system is *that* overloaded that
corosync isn't being scheduled and it's rather limited network traffic
presents a problem, it is likely so FUBAR'ed that fencing it doesn't
make things worse. So the misdiagnosis isn't necessarily a problem.

 BTW: We had eperienced hanging I/O when one of our SAN devices had a
 problem, but the others did not. Still the SLES11 SP2 kernel saw
 stalled I/Os for obviously unaffected devices. The problem is being
 investigated...

FC can be weird like that if it is routed through the same HBA or
switch. It's not always a kernel problem, the fabric isn't trivial
either. Good luck with finding the root cause :-/


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: General question about heartbeat tokens and node overloaded.

2013-10-02 Thread Lars Marowsky-Bree
On 2013-10-02T13:40:16, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

 There is one notable exception: If you have shared storage (SAN, NAS, NFS), 
 the cause of the slowness may be external to the systems being monitored, 
 thus fencing those will not improve the situation, most likely.

True. Alas, the cluster stack doesn't magically know that, but must be
told - to either allow for long enough timeouts for services on such
media (because even then, the slowness maybe external, or caused
internally), or to not monitor them.

Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: Xen RA and rebooting

2013-10-01 Thread Lars Marowsky-Bree
On 2013-10-01T00:53:15, Tom Parker tpar...@cbnco.com wrote:

 Thanks for paying attention to this issue (not really a bug) as I am
 sure I am not the only one with this issue.  For now I have set all my
 VMs to destroy so that the cluster is the only thing managing them but
 this is not super clean as I get failures in my logs that are not really
 failures.

It is very much a severe bug.

The Xen RA has gained a workaround for this now, but we're also pushing
the Xen team (where the real problem is) to investigate and fix.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] 2 node clustes with seperate quorum server

2013-09-25 Thread Lars Marowsky-Bree
On 2013-09-24T20:55:40, AvatarSmith cgasm...@comcast.net wrote:

 I'm having a bit of an issue under centos 6.4 x64. I have two duplcate
 hardware systems (raid arrays, 10G nics' etc) configured identically and
 drbd replication is working fine in the cluster between the two. When I
 started doing fail over testing, via power/communcations interruptions I
 found that I could not reliably shift the resources from cluster1 to
 cluster2 even though they are identical in every aspect. I AM able (by
 starting pacemaker first on one or the other) to get the cluster up on
 either node. I was told that this is a problem for a non-stonith 2 node
 cluster and to add a third server to provide the quorum vote to tell the
 survivor to host the cluster resources.

The best bet, in my humble opinion, in this case is truly to setup a
quorum node - but not via a full member in the cluster.

This is not in CentOS, but if you use sbd as a fencing mechanism and
enable the pacemaker integration (see
https://github.com/l-mb/sbd/blob/master/man/sbd.8.pod), that allows you
to use an iSCSI target as a quorum node.

(Think of it as a quorum node implemented on top of a standard storage
protocol.)

That means you can either install a small iSCSI server somewhere (easily
done under Linux, export a 1MB LUN from a VM or something), or utilize
existing storage servers to provide that.

I don't currently have a build for CentOS, but I'd welcome patches to
the specfile to make it work there ;-)

 hardware) I still get messages to the effect that the p_drbd_r0 monitor wont
 load becasue its not installedwell duh, its not installed but its not
 suppose to be running on node3 anyway why is trying to monitor on a node its
 not installed or permitted to run on ? 

monitor_0 is the initial probe that makes sure that the service is
really not running where the service forbids it to run. This is expected
behaviour.


 flags can I toggle to help point out the way or does PCS / crm_x provide
 a better interface for configuring/debugging this ? 

This behaviour has nothing to do how you configure pacemaker, but is
core to pacemaker itself. Though I think that later versions may have
learned (or will learn) to hide ERR_INSTALLED results in crm_mon if
there's a -inf location rule, the backend remains the same.

 Lastly, during my failover testing and configuration testing, I found the
 only surefire way to apply a new cluster config, is to cibadmin -f -E and
 cut and paste in a new one followed by a rebootwhat a pain. You can
 sometimes get away with restarting pacemaker on all nodes, bringing up  your
 intended primary first then the others later

This clearly isn't good and worth debugging. Changes to the CIB ought to
take effect immediately without a restart.



Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] 2 node clustes with seperate quorum server

2013-09-25 Thread Lars Marowsky-Bree
On 2013-09-25T11:00:17, Chuck Smith cgasm...@comcast.net wrote:

 do act accordingly, for instance, I have raw primitives, add them to a
 group, then decide to move them to a different group (subject to load order)
 I cant just remove it from the group and put it in a different one, I have
 to remove it first, commit the change, the add it to the new group, and
 commit the change again.

Moving a resource from one group to another without unnecessary restarts
should be quite possible.

But you are right - it is not. The crmsh backend is not *quite* smart
enough if you're doing it in one edit session. (e.g., removing it
somewhere and adding it elsewhere in the same step.) And I think it
should be, hence Cc'ing Kristoffer I'm rewriting the parser Gronlund
;-)

For now, you have to split it in two steps:

# edit (and now remove the resources from groups)
# edit (and now add them again)
# commit

Note you don't need an intermediate commit, so no restarts should
happen. (Unless you add a stopped resource to the middle of an otherwise
started group, or vice versa.)

You can also do something like:

crm(live)configure# show
group grp1 dummy1 dummy2 dummy3
group grp2 dummy4 dummy5
crm(live)configure# modgroup grp1 remove dummy2
crm(live)configure# modgroup grp2 add dummy2
crm(live)configure# show
group grp1 dummy1 dummy3
group grp2 dummy4 dummy5 dummy2

See help modgroup for more details. That might be useful if you're
doing this regularly or need it scripted.

And just in case you weren't aware of this, you can ask what the cluster
would do before actually committing:

crm(live)configure# simulate actions nograph
  notice: LogActions:   Movedummy4  (Started hex-2 - hex-3)
  notice: LogActions:   Movedummy5  (Started hex-2 - hex-3)
  notice: LogActions:   Movedummy2  (Started hex-1 - hex-3)
  notice: LogActions:   Movedummy1  (Started hex-3 - hex-1)
  notice: LogActions:   Restart dummy3  (Started hex-1)

(I had my resources ungrouped before this example, so I did exactly what
I told you not to do to avoid spurious restarts ;-)


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Xen RA and rebooting

2013-09-17 Thread Lars Marowsky-Bree
On 2013-09-16T16:36:38, Tom Parker tpar...@cbnco.com wrote:

  Can you kindly file a bug report here so it doesn't get lost
  https://github.com/ClusterLabs/resource-agents/issues ?
 Submitted (Issue *#308)*

Thanks.

 It definitely leads to data corruption and I think has to do with the
 way that the locking is not working properly on my lvm partitions. 

Well, not really an LVM issue. The RA thinks the guest is gone, the
cluster reacts and schedules it to be started (perhaps elsewhere); and
then the hypervisor starts it locally again *too*.

I think changing those libvirt settings to destroy could work - the
cluster will then restart the guest appropriately, not the hypervisor.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Xen RA and rebooting

2013-09-17 Thread Lars Marowsky-Bree
On 2013-09-17T11:38:34, Ferenc Wagner wf...@niif.hu wrote:

 On the other hand, doesn't the recover action after a monitor failure
 consist of a stop action on the original host before the new start, just
 to make sure?  Or maybe I'm confusing things...

Yes, it would - but it seems there's a brief period during reboot where
the guest is shown as gone/cleanly stopped, and then the stop action
will just see the very same.

Actually that strikes me as a problem with Xen/libvirt's reporting.

Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Clone colocation missing? (was: Pacemaker 1.19 cannot manage more than 127 resources)

2013-09-14 Thread Lars Marowsky-Bree
On 2013-09-13T17:48:40, Tom Parker tpar...@cbnco.com wrote:

 Hi Feri
 
 I agree that it should be necessary but for some reason it works well 
 the way it is and everything starts in the correct order.  Maybe 
 someone on the dev list can explain a little bit better why this is 
 working.  It may have something to do with the fact that it's a clone 
 instead of a primitive.

And luck. Your behaviour is undefined, and will work for most of the
common cases.

: versus inf: on the order means that, during a healthy start-up,
A will be scheduled to start before B. It does not mean that B will need
to be stopped before A. Or that B shouldn't start if A can't. Typically,
both are required.

Since you've got ordering, *normally*, B will start on a node where A is
running. However, if A can't run on a node for any given reason, B will
still try to start there without collocation. Typically, you'd want to
avoid that.

The issues with the start sequence tend to be mostly harmless - you'll
just get additional errors for failure cases that might distract you
from the real cause.

The stop issue can be more difficult, because it might imply that A
fails to stop because B is still active, and you'll get stop escalation
(fencing). However, it might also mean that A enters an escalated stop
procedure itself (like Filesystem, which will kill -9 processes that are
still active), and thus implicitly stop B by force. That'll probably
work, you'll see everything stopping, but it might require additional
time from B on next start-up to recover from the aborted state.

e.g., you can be lucky, but you also might turn out not to be. In my
experience, this means it'll all work just fine during controlled
testing, and then fail spectacularly under a production load.

Hence the recommendation to fix the constraints ;-)


(And, yes, this *does* suggest that we made a mistake in making this so
easy to misconfigure. But, hey, ordering and colocation are independent
concepts! The design and abstraction is pure! And I admit that I guess
I'm to blame for that to a large degree ... If the most common case is
A, then B + B where A, why isn't there a recommended constraint that
just does both with no way of misconfiguring that? It's pretty high on
my list of things to have fixed.)


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Xen RA and rebooting

2013-09-14 Thread Lars Marowsky-Bree
On 2013-09-14T00:28:30, Tom Parker tpar...@cbnco.com wrote:

 Does anyone know of a good way to prevent pacemaker from declaring a vm
 dead if it's rebooted from inside the vm.  It seems to be detecting the
 vm as stopped for the brief moment between shutting down and starting
 up. 

Hrm. Good question. Because to the monitor, it really looks as if the VM
is temporarily gone, and it doesn't know ... Perhaps we need to keep
looking for it for a few seconds.

Can you kindly file a bug report here so it doesn't get lost
https://github.com/ClusterLabs/resource-agents/issues ?

 Often this causes the cluster to have two copies of the same vm if the
 locks are not set properly (which I have found to be unreliable) one
 that is managed and one that is abandonded.

*This* however is really, really worrisome and sounds like data
corruption. How is this happening?


The work-around right now is to put the VM resource into maintenance
mode for the reboot, or to reboot it via stop/start of the cluster
manager.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] FW: Can't seem to shutdown the DC

2013-09-12 Thread Lars Marowsky-Bree
On 2013-09-12T18:14:04, marcy.d.cor...@wellsfargo.com wrote:

 
 Hello list,
 
 Using SUSE SLES 11 SP2.
 
 I have 4 servers in a cluster running cLVM + OCFS2.
 
 If I tried to shutdown the one that is the DC using openais stop, strange 
 things happen resulting in a really messed up cluster.
 One one occasion, another server decided he was the DC and the other 2 still 
 thought the original DC was online and still it.
 Often it results in fencing and lots of reboots.
 If I tried to put the DC into standby mode, I get this

Your configuration for Pacemaker looks OK. My expectation would be that
you are suffering some networking issue.

Can you kindly open a support request with Novell Technical Services?


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Problem with crm 1.2.6 rc 3 and older

2013-09-10 Thread Lars Marowsky-Bree
On 2013-09-09T17:22:14, Dejan Muhamedagic deja...@fastmail.fm wrote:

  a) When pacemaker and all other commandline tool can live
  nicely with multiple meta-attributes sections (it seems to
  be allowed by the xml definition) and address all nvpairs
  just by name beneath this tag, than crm should insert or
  change only one instance of that nvpair. Demoting via crm_resource
  should have worked with this, shouldn't it?
 If there are multiple attributes sets, then you need to specify
 the id if you want crm_resource to operate on any of these.

If there are multiple ones not further affected by any rules, pick
either the one which already has the attribute name or the first one?

  b) When the duplication of the meta-attriutes section by
  itself is problematic I would expect crm refusing or
  at least warning about that construct.
 Right, it should do something about it.

Is this something created by crmsh or crm_attribute?

  What do you think?
 Since multiple attribute sets are allowed, then we need to
 support them too. However, in a case like this one, I guess that
 we should've put together the two attribute sets.

I'm not perfectly sure. We need to understand how they were created; I'm
pretty sure that we should not *insert* them first time; not even when
the user does something silly like

primitive ... \
meta ... \
params ... \
meta ...

Then they ought to be merged, I think.

But if they already exist, merging them silently probably isn't so hot.

 Could you please open a bug report. It seems like we'd need to
 track the discussion and possible solutions.

Thanks!


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: Max number of resources under Pacemaker ?

2013-09-05 Thread Lars Marowsky-Bree
On 2013-09-04T08:26:14, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

 In my experience network traffic grows somewhat linear with the size
 of the CIB. At some point you probably have to change communication
 parameters to keep the cluster in a happy comminication state.

Yes, I wish corosync would auto-tune to a higher degree. Apparently
though, that's a slightly harder problem.

We welcome any feedback on required tunables. Those that we ship on SLE
HA worked for us (and even for rather largeish configurations), but they
may not be appropriate everywhere.

 Despite of the cluster internals, there may be problems if a node goes
 online and hundreds of resources are started in parallel, specifically
 if those resources weren't designed for it. I suspect IP addresses,
 MD-RAIDs, LVM stuff, drbd, filesystems, exportfs, etc.

No, most of these resource scripts *are* supposed to be
concurrency-safe. If you find something that breaks, please share the
feedback.

It's true that the way how concurrent load limitation is implemented in
Pacemaker/LRM isn't perfect yet. batch-limit is rather coarse. The
per-node LRM child limit is probably the best bet right now. But it
doesn't differentiate between starting many light-weight resources in
parallel (such as IPaddr) versus heavy-weights (VMs with Oracle
databases).

(migration-threshold goes in the same direction.)

Historical context matters. Pacemaker comes from the HA world; we still
believe 3-7 node clusters are the largest anyone ought to reasonably
build, considering the failure/admin/security domain issues with single
point of failures and the increasing likelihood of double failures etc.

But there's several trends -

Even those 3-7 nodes become increasingly powerful multi-core kick-ass
boxes. 7 nodes might well host hundreds of resources nowadays (say,
above 70 VMs with all their supporting resources).

People build much larger clusters because there's no good way to divide
and conquer yet - e.g., if you build several 3 or 5 node clusters,
there's no support for managing those clusters-of-clusters.

And people use Pacemaker for HPC style deployments (e.g., private
clouds with tons of VMs) - because while our HPC support is suboptimal,
it is better than the HA support in most of the Cloud offerings.


 As a note: Just recently we had a failure in MD-RAID activation with no real
 reason to be found in syslog, and the cluster got quite confused.
 (I had reported this to my favourite supporter (SR 10851868591), but haven't
 heard anything since then...)

I'll try to dig that out of the support system and give it a look.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Quick 'death match cycle' question.

2013-09-03 Thread Lars Marowsky-Bree
On 2013-09-03T10:25:58, Digimer li...@alteeve.ca wrote:

 I've run only 2-node clusters and I've not seen this problem. That said,
 I've long-ago moved off of openais in favour of corosync. Given that
 membership is handled there, I would look at openais as the source of your
 trouble.

This is, sorry, entirely wrong. openais does not handle membership.

This sounds more like a DC election race window or some such in
pacemaker. If Ulrich files a bug report with proper logs, I'm sure we
can resolve this (perhaps with an update to SP3 and a more recent
pacemaker release ;-).


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Quick 'death match cycle' question.

2013-09-03 Thread Lars Marowsky-Bree
On 2013-09-03T13:04:52, Digimer li...@alteeve.ca wrote:

 My mistake then. I had assumed that corosync was just a stripped down
 openais, so I figured openais provided the same functions. My personal
 experience with openais is limited to my early days of learning HA
 clustering on EL5.

Yes and no. SLE HA 11 ships openais as a corosync add-on, because it
still uses the AIS CKPT service for OCFS2 (which can't be changed w/o
breaking wire-compatibility). But it's already corosync underneath.

It still calls the init script openais for compatibility reasons (so
that existing scripts that call that don't fail). So that can be
confusing, because one still starts/stops openais ...

In openSUSE Factory, we're  this close to basing the stack on latest
upstream of everything and cleanly so, I hope ;-)


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Quick 'death match cycle' question.

2013-09-03 Thread Lars Marowsky-Bree
On 2013-09-03T21:14:02, Vladislav Bogdanov bub...@hoster-ok.com wrote:

  To solve problem 2, simply disable corosync/pacemaker from starting on
  boot. This way, the fenced node will be (hopefully) back up and running,
  so you can ssh into it and look at what happened. It won't try to rejoin
  the cluster though, so no risk of a fence loop.
 Enhancement to this would be enabling corosync/pacemaker back during the
 clean shutdown and disabling it after boot.

There's something in sbd which does this. See
https://github.com/l-mb/sbd/blob/master/man/sbd.8.pod and the -S option.
I'm contemplating how do to this in a generic fashion.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Pacemaker 1.19 cannot manage more than 127 resources

2013-08-30 Thread Lars Marowsky-Bree
On 2013-08-29T15:49:30, Tom Parker tpar...@cbnco.com wrote:

 Hello.  Las night I updated my SLES 11 servers to HAE-SP3 which contains
 the following versions of software:

Could you kindly file a report via NTS? That's the way to get official
and timely support for SLE HA. (I don't mean to cut off a mailing list
discussion here, but we can't prioritize it higher unless it's in our
support chain, otherwise my boss asks silly questions like are you
working for our paying customers ;-)


For what it is worth, it is likely that you could shrink your
configuration considerably if you used resource templates for the
ocf:heartbeat:Xen parts.

You could massively condense the primitive definitions and probably
replace all related order/colocation constraints with one of each. That
probably would allow you to even handle this with a smaller buffer
size.

 order aruauth1-qa-after-storage : storage-clone aruauth1-qa

And the order constraints too need to be inf:, not just :, by the
way.


Best,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-ha-dev] crmsh error : cib-bootstrap-options already exist

2013-08-29 Thread Lars Marowsky-Bree
On 2013-08-28T20:13:43, Dejan Muhamedagic de...@suse.de wrote:

 A new RC has been released today. It contains both fixes. It
 doesn't do atomic updates anymore, because cibadmin or something
 cannot stomach comments.

Couldn't find the upstream bug report :-( Can you give me the pacemaker
bugid, please? Thanks!


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA-Dev: Linux-HA-Dev@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha-dev
Home Page: http://linux-ha.org/


Re: [Linux-HA] establishing a new resource-agent package provider

2013-08-13 Thread Lars Marowsky-Bree
On 2013-08-13T20:53:13, Andrew Beekhof and...@beekhof.net wrote:

  I'd:
  - Rename the provider to core
  - Rework our own documentation and as we find it
  - Transparently support references to ocf:heartbeat forever:
   - Re-map s/heartbeat/core/ in the LRM (silently, or it'd get really
 bloody annoying)
 
 Why not just create a symlink?  It doesn't even really matter in which 
 direction.
 Then crmsh/pcs just needs to filter whatever they choose to.
 No mapping needed.

Right, that counts as mapping, I guess. I'm fine with that too.

   - When a new resource created with that provider, rewrite it to core
 with an LOG_INFO message given to the user
   - Hide ocf:heartbeat from the normal list (or show it as
 depreciated, use core instead); but when someone types
 ocf:heartTAB, auto-complete to it and of course auto-complete
 all parameters.

Same here, those are just UI issues.

  So in short: Rename, but remain backwards-compatible (since the price is
  low).
 Was anyone proposing anything different?

I wasn't sure and thus wanted to make sure we weren't looking at a hard
rename situation. So all in all, I'm fine with renaming it.

(I've not had a large number of inquiries as to why something in the
stack is called heartbeat, though; and if, a It's history was always
sufficient. So my feelings aren't very strong about it, either. I guess
the rename will trade them for so are these different from the
heartbeat ones? Which one do I have to use? questions.)


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: beating a dead horse: cLVM, OCFS2 and TOTEM

2013-07-12 Thread Lars Marowsky-Bree
On 2013-07-12T11:05:32, Wengatz Herbert herbert.weng...@baaderbank.de wrote:

 Seeing the high dropping quote... (just compare this to the other NIC) - have 
 you tried a new cable? Maybe it's a cheap hardware problem...

The drop rate is normal. A slave NIC in a bonded active/passive
configuration will drop all packets.

I do wonder why there's so much traffic on a supposedly passive NIC,
though.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: crm resource restart is broken (d4de3af6dd33)

2013-07-12 Thread Lars Marowsky-Bree
On 2013-07-12T12:19:40, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

 BTW: The way resource restart is implemented (i.e.: stop  wait, then 
 start) has a major problem: If stop causes to fence the node where the crm 
 command is running, the resource will remain stopped even after the node 
 restarted.

Yes. That's a limitation that's difficult to overcome. restart is a
multi-phase command, so if something happens to the node where it runs,
you have a problem.

But the need for a manual restart should be rare. If the resource is
running and healthy according to monitor, why should it be necessary?
;-)

(Another way to trigger a restart is to modify the instance parameters.
Set __manual_restart=1 and it'll restart.)


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: crm resource restart is broken (d4de3af6dd33)

2013-07-12 Thread Lars Marowsky-Bree
On 2013-07-12T12:26:18, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

  (Another way to trigger a restart is to modify the instance parameters.
  Set __manual_restart=1 and it'll restart.)
 once? ;-)

Keep increasing it. ;-)


-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: beating a dead horse: cLVM, OCFS2 and TOTEM

2013-07-11 Thread Lars Marowsky-Bree
On 2013-07-11T08:41:33, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

  For a really silly idea, but can you swap the network cards for a test?
  Say, with Intel NICs, or even another Broadcom model?
 Unfortunately no: The 4-way NIC is onboard, and all slots are full.

Too bad.

But then you could really try raising a support request about the
network driver, perhaps one of the kernel/networking gurus has an idea.

 RX packet drops. Maybe the bug is in the bonding code...
 bond0: RX packets:211727910 errors:0 dropped:18996906 overruns:0 frame:0
 eth1: RX packets:192885954 errors:0 dropped:21 overruns:0 frame:0
 eth4: RX packets:18841956 errors:0 dropped:18841956 overruns:0 frame:0
 
 Both cards are identical. I wonder: If bonding mode is fault-tolerance
 (active-backup), is it normal then to see such statistics. ethtool -S reports
 a high number for rx_filtered_packets...

Possibly. It'd be interesting to know what packets get dropped; this
means you have approx. 10% of your traffic on the backup link. I wonder
if all the nodes/switches/etc agree on what is the backup port and what
isn't ...?

If 10% of the communication ends up on the wrong NIC, that surely would
mess up a number of recovery protocols.

An alternative test case would be to see how the system behaves if you
disable bonding - or if the names should stay the same, only one NIC in
the bond.



Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: Master/Slave status check using crm_mon

2013-07-10 Thread Lars Marowsky-Bree
On 2013-07-10T13:26:32, John M john332...@gmail.com wrote:

 Current application supports only the Master/Slave configuration and
 there can be one master and one slave process in a group.

A cluster can host multiple groups. You could, indeed, group your
systems into 3 or 5 node clusters, and not bother with quorum nodes
but get real quorum.

 Also each pair(Master/Slave) processes certain set of data.
 
 It would be great if you could confirm whether I can go ahead with
 STONITH feature.

Of course. You would need STONITH and fencing even for larger
environments.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] beating a dead horse: cLVM, OCFS2 and TOTEM

2013-07-10 Thread Lars Marowsky-Bree
On 2013-07-10T08:31:17, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

 I had reported about terrible performance of cLVM (maybe related to using 
 OCFS also) when uses in SLES11 SP2. I guesses cLVM (or OCFS2) is 
 communicating to death on activity. Now I have some interesing news:

No, the performance issue with cLVM2 mirroring is not at all related to
OCFS2; that's just cLVM2's algorithm being, well, suboptimal.

 on top of cLVM/OCFS I have image files for Xen VMs. I set up an OpenLDAP 
 server in one of the VMs. Now about everytime the LDAP server gets an update 
 (meaning id does some flushed disk writes), corosync reports a faulty ring. 
 It's like:

That, though, clearly shouldn't happen. And I've never seen this,
despite hosting a few VMs on my OCFS2 cluster (even with cLVM2
mirroring).

Network problems in hypervisors though also have a tendency to be, well,
due to the hypervisor, or some network cards (broadcom?).

  # grep FAULTY /var/log/messages |wc -l
 1546
 
 However the FAULT never lasts longer than one second.

That's weird. Multicast or unicast?


 OTOH our network guy says it's impossible to use the full network
 bandwidth. This makes me wonder: Is there a protocol implementation
 bug in TOTEM that is triggered when lots of packets arrive or when
 packets are delayed slightly, or is there a kernel bug that looses
 packets?

My guess would be the latter here.

Can this be reproduced with another high network load pattern? Packet
loss etc?

 Is there any perspective to see the light at the end of the tunnel? The 
 problems should be easily reproducable.

Bugs that get reported have a chance of being fixed ;-)


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: beating a dead horse: cLVM, OCFS2 and TOTEM

2013-07-10 Thread Lars Marowsky-Bree
On 2013-07-10T14:33:12, Ulrich Windl ulrich.wi...@rz.uni-regensburg.de wrote:

  Network problems in hypervisors though also have a tendency to be, well,
  due to the hypervisor, or some network cards (broadcom?).
 
 Yes:
 driver: bnx2
 version: 2.1.11
 firmware-version: bc 5.2.3 NCSI 2.0.12

For a really silly idea, but can you swap the network cards for a test?
Say, with Intel NICs, or even another Broadcom model?

  Can this be reproduced with another high network load pattern? Packet
  loss etc?
 No, but TCP handles packet loss more gracefully than the cluster, it seems.

A single lost packet shouldn't cause that, I think. (There may, of
course, also be more problems hidden in corosync.) Anything showing up
on the ifconfig stats or with a ping flood?

Some network card and hypervisor combos apparently don't play well with
multicast, either. You could also try switching to unicast
communication.

And if this reproduces, you could try the SP3 update which ought to be
mirrored out now (which includes a corosync update and a kernel refresh;
corosync 1.4.6 is already in the maintenance queue).

Sometimes I think the worst part about distributed processes is that it
has to rely on networking. But then I remember it relies on human
programmers too and the network isn't looking so bad any more ;-)


  Is there any perspective to see the light at the end of the tunnel? The 
  problems should be easily reproducable.
  Bugs that get reported have a chance of being fixed ;-)
 One more bug and my suport engineer kills me ;-)

There's already a bounty on your head, it can't get any worse ;-)


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Best version of Heartbeat + Pacemaker

2013-07-09 Thread Lars Marowsky-Bree
On 2013-07-08T22:35:31, Digimer li...@alteeve.ca wrote:

 As for multi-DC support, watch the booth project. It's supposed to bring
 stretch clustering to corosync + pacemaker.

Stretch clustering is already possible and supported (depending on whom
you ask; it is on SLE HA) with corosync.

booth brings not stretched clustering but multi-site via
clusters-of-clusters; different architecture.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: Master/Slave status check using crm_mon

2013-07-09 Thread Lars Marowsky-Bree
On 2013-07-09T20:06:45, John M john332...@gmail.com wrote:

 Now I want to know
 1. Can I use a node which is part of another cluster to run quorum node?
 2. Can I configure a standalone quorum node that can manager 25 Clusters?

No. Using the quorum node approach, a node can only ever be part of one
cluster. You could, of course, host VMs on the node.

But seriously, the whole add a node just as a quorum member is 98%
broken. Just. Don't.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Antw: Re: Master/Slave status check using crm_mon

2013-07-09 Thread Lars Marowsky-Bree
On 2013-07-09T23:11:01, John M john332...@gmail.com wrote:

 So STONITH/feancing is the only option?

A quorum node is no alternative to fencing, anyway.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] disallowing concurrent configuration (CIB modifications)

2013-07-05 Thread Lars Marowsky-Bree
On 2013-07-05T19:06:54, Vladislav Bogdanov bub...@hoster-ok.com wrote:

  params #merge param1=value1 param2=value2
  
  meta #replace ...
  
  utilization #keep
  
  and so on. With default to #replace?
 
 Even more.
 If we allow such meta lexems anywhere (not only at the very
 beginning), then they may be applied only to the rest of string (or
 before other meta lexem).
 
 The best thing I see about this idea is this is fully backwards compatible.

From a language aesthetics point of view, this gives me the utter
creeps. Don't make me switch to pcs! ;-)

I could live with a proper merge/update, replace command as a
prefix, just like we now have delete, though. Similar to what we do
for groups.

Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] disallowing concurrent configuration (CIB modifications)

2013-07-03 Thread Lars Marowsky-Bree
On 2013-07-03T00:20:19, Vladislav Bogdanov bub...@hoster-ok.com wrote:

 I do not edit them. I my setup I generate full crm config with
 template-based framework.

And then you do a load/replace? Tough; yes, that'll clearly overwrite
what is already there and added by scripts that more dynamically modify
the CIB.

Since we don't know your complete merging rules, it's probably easier if
your template engine gains hooks to first read the CIB for setting those
utilization values.

 That is very convenient way to f.e stop dozen of resources in one shot
 for some maintenance. I have special RA which creates ticket on a
 cluster start and deletes it on a cluster stop. And many resources may
 depend on that ticket. If you request resource handled by that RA to
 stop, ticket is revoked and all dependent resources stop.
 
 I wouldn't write that RA if I have cluster-wide attributes (which
 perform like node attributes but for a whole cluster).

Right. But. Tickets *are* cluster wide attributes that are meant to
control the target-role of many resources depending on them. So you're
getting exactly what you need, no? What is missing?


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] disallowing concurrent configuration (CIB modifications)

2013-07-03 Thread Lars Marowsky-Bree
On 2013-07-03T10:26:09, Dejan Muhamedagic deja...@fastmail.fm wrote:

  Not sure that is expected by most people.
  How you then delete attributes?
 Tough call :) Ideas welcome.

Set them to an empty string, or a magic #undef value.

 It's not only for the nodes. Attributes of resources should be
 merged as well. Perhaps to introduce another load method, say
 merge, which would merge attributes of elements instead of
 replacing them. Though the use would then get more complex (which
 seems to be justified here).

Well, that leaves open the question of how higher-level objects
(primitives, clones, groups, constraints ...) would be affected/deleted.

I'm not sure the complexity is really worth it. Merge rules get *really*
complex, quickly. And eventually, one ends with the need to annotate the
input with how one wants a merge to be resolved (such as #undef
values).

Then, one might go the easier way of having the caller tell us what they
want changed explicitly, instead of having to figure it out ourselves.
The whole I'll magically replace the whole configuration and crmsh will
figure it out premise seems broken.


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] Backing out of HA

2013-07-02 Thread Lars Marowsky-Bree
On 2013-07-01T16:31:13, William Seligman selig...@nevis.columbia.edu wrote:

 a) people can exclaim You fool! and point out all the stupid things I did 
 wrong;
 
 b) sysadmins who are contemplating the switch to HA have additional points to
 add to the pros and cons.

I think you bring up an important point that I also try to stress when I
talk to customers: HA is not for everyone, since it's not a magic
bullet. HA environments can protect against hardware faults (and some
operator issues and failing software), but they need to be carefully
managed and designed. They don't come for free, and the complexity can
be a deterrent.

(While the Complex, but not complicated is a good goal, it's not
easily achieved.)

And why the additional recovery plans should always include Plan C: How
do I bring my services online manually without a cluster stack?

 I'll mention this one first because it's the most recent, and it was the straw
 that broke the camel's back as far as the users were concerned.
 
 Last week, cman crashed, and the cluster stopped working. There was no clear
 message in the logs indicating why. I had no time for archeology, since the
 crash happened in middle of our working day; I rebooted everything and cman
 started up again just fine.

Such stuff happens; without archaeology, we're unlikely to be able to
fix them ;-) I take it though you're not running the latest supported
versions and don't have good support contracts; that's really
important.

And, of course, why we strive to produce support tools that allow first
failure data capture - so we can get a full overview of the log files
and what triggered whatever problem the system encountered, without
needing to reproduce.  (crm_/hb_report, qb_blackbox, etc.)

 Problems under heavy server load:
 
 Let's call the two nodes on the cluster A and B. Node A starts running a 
 process
 that does heavy disk writes to the shared DRBD volume. The load on A starts to
 rise. The load on B rises too, more slowly, because the same blocks must be
 written to node B's disk.
 
 Eventually the load on A grows so great that cman+clvmd+pacemaker does not
 respond promptly, and node B stoniths node A. The problem is that the DRBD
 partition on node B marked Inconsistent. All the other resources in the
 pacemaker configuration depend on DRBD, so none of them are allowed to run.

This shouldn't happen. The cluster stack is supposed to be isolated from
the mere workload via realtime scheduling/IO priority and locking
itself into memory. Or you had too short timeouts for the monitoring
services.

(I have noticed a recent strive to drop the SCHED_RR priority from
processes, because that seemingly makes some problems go away. But
personally, I think that just masks the issue of priority inversion in
the message layers somewhere, and isn't a proper fix; exposing us to
more situations as described here instead.)

But even that shouldn't lead to stonith directly, but to resources being
stopped. Only if that fails would it then cause a fence.

And - a fence also shouldn't make DRBD inconsistent like this. Is your
DRBD set up properly?

 Poisoned resource
 
 This is the one you can directly attribute to my stupidity.
 
 I add a new resource to the pacemaker configuration. Even though the pacemaker
 configuration is syntactically correct, and even though I think I've tested 
 it,
 in fact the resource cannot run on either node.
 
 The most recent example: I created a new virtual domain and tested it. It 
 worked
 fine. I created the ocf:heartbeat:VirtualDomain resource, verified that crm
 could parse it, and activated the configuration. However, I had not actually
 created the domain for the virtual machine; I had typed virsh create ... but
 not virsh define 
 
 So I had a resource that could not run. What I'd want to happen is for the
 poisoned resource to fail, I see lots of error messages, but the remaining
 resources would continue to run.
 
 What actually happens is that resource tries to run on both nodes alternately 
 an
 infinite number of times (1 times or whatever the value is). Then one of
 the nodes stoniths the other. The poisoned resource still won't run on the
 remaining node, so that node tries restarting all the other resources in the
 pacemaker configuration. That still won't work.

Yeah, so, you describe a real problem here.

Poisoned resources indeed should just fail to start and that should be
that. What instead can happen is that the resource agent notices it
can't start, reports back to the cluster, and the cluster manager goes
Oh no, I couldn't start the resource successfully! It's now possibly in
a weird state and I better stop it!

... And because of the misconfiguration, the *stop* also fails, and
you're hit with the full power of node-level recovery.

I think this is an issue with some resource agents (if the parameters
are so bad that the resource couldn't possibly have started, why fail
the stop?) and possibly also something where one 

Re: [Linux-HA] disallowing concurrent configuration (CIB modifications)

2013-07-02 Thread Lars Marowsky-Bree
On 2013-07-02T11:05:01, Vladislav Bogdanov bub...@hoster-ok.com wrote:

 One thing I see immediately, is that node utilization attributes are
 deleted after I do 'load update' with empty node utilization sections.
 That is probably not specific to this patch.

Yes, that isn't specific to that.

 I have that attributes dynamic, set from a RA (as node configuration may
 vary, I prefer to detect how much CPU and RAM I have and set utilization
 accordingly rather then put every hardware change into CIB).

 Or may be it is possible to use transient utilization attributes?
 I don't think so... Ugh, that would be nice.

Yes, that's exactly what you need here.

Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


Re: [Linux-HA] disallowing concurrent configuration (CIB modifications)

2013-07-02 Thread Lars Marowsky-Bree
On 2013-07-02T13:14:48, Vladislav Bogdanov bub...@hoster-ok.com wrote:

  Yes, that's exactly what you need here.
 I know, but I do not expect that to be implemented soon.

crm_attribute -l reboot -z doesn't strike me as an unlikely request.
You could file an enhancement request for that.

But with the XML diff feature, as long as you're not editing the node
section, that shouldn't be a problem - unrelated changes shouldn't
overwrite those attributes, right? That being the whole point?

(Of course, if you remove them in the copy, that'd remove them.)

 But tickets currently are quite limited - they have only 4 states, so
 it is impossible to put f.e. number there.

What are you trying to do with that?


Regards,
Lars

-- 
Architect Storage/HA
SUSE LINUX Products GmbH, GF: Jeff Hawn, Jennifer Guild, Felix Imendörffer, HRB 
21284 (AG Nürnberg)
Experience is the name everyone gives to their mistakes. -- Oscar Wilde

___
Linux-HA mailing list
Linux-HA@lists.linux-ha.org
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems


  1   2   3   4   5   6   7   8   9   10   >