Hi LMB,

> > From my experience with SLES11 SP2 (with all current updates) I conclude 
> > that actually nobody is seriously running SP2 without local bugfixes.

> That isn't quite true.

No, this is true.
I provided an example which can easily be reproduced with stock SLES 11 SP2 and 
stock documentation.
On the other hand you sofar did not provide any case where SLES11 SP2 runs 
reliably unmodified in a mission critical environment (e.g. a HA NFS server) 
without local bugfixes.

> > E.g. Even the most simple examples from the official SuSE documentation 
> > don't work as expected.

> Which ones?

Are you actually reading the messages on this list before replying? I provided 
an example just one line below.

> > A trivial example is ocf:heartbeat:exportfs as distributed by SuSE with SP2 
> > causes unlimited growth of .rmtab files (goes fast in the gigabytes for 
> > serious NFS servers). I could work around this issue using some shell 
> > scripting.

This is exactly the simple example of a resource not working on a fully updated 
SLES 11 SP2 HA Cluster.
SuSE provides an official guide how to setup a highly available NFS cluster. 
When following this guide this rather simple use case is not working since many 
months.

> This is an annoying bug, yes. It's an upstream bug that's been fixed now. 
> I'll check if the maintenance update is already released.

This bug is not fixed in SLES 11 SP2 since many months. The fact that you are 
aware of it but don't make a maintenance release for obvious bugs which are 
triggered in the default use cases has something to say.

http://www.suse.com/support/kb/doc.php?id=7008514

This is not an annoying bug it turns the cluster unusable after some days of 
usage. This is not acceptable and dangerous in production environments. (HA 
clusters tend to be used in more critical environments)

The fact that you don't take care that fixes are available in a timely manner 
even though you claim that the issue was fixed upstream shows that you SuSE is 
not commited in supporting missions setups.

Funny enough today running single servers is more reliable and provides better 
uptimes and service availabilities than using SLES HA Extension.

> > There are other issues which are more than annoying and actually make
> > the SLES SP2 HA Extension unusable for production systems. E.g. clvmd
> > cannot be made less verbose from the cluster configuration. (No
> > daemon_options="-d0" does not help!)

> It shouldn't log that much even at the regular loglevel. It's reasonably 
> quiet (except on fail-/switch-overs, of course). What do you find excessive?

Are you actually running yourself a single instance of a SLES 11 SP2 cluster in 
production?

Did you ever check yourself the logfiles on any uptodate SLES11 SP2 HA cluster?

rt-lxcl9b:/var/log # cat /var/log/messages | wc -l
74838
rt-lxcl9b:/var/log # cat /var/log/messages | grep clvmd | wc -l
74028
rt-lxcl9b:/var/log # ps uax | grep clvmd
root      3227  0.0  0.0 149100 46404 ?        SLsl Aug10   0:24 
/usr/sbin/clvmd -d0

These are about 74000 (*) messages from clvmd in about 40h.

(No failovers, switches or anything else which are expected to cause logging)

> > Not funny is also the fact that the official SLES 11 SP2 kernels crash
> > seriously (when a node rejoins the cluster) when using STCP as
> > recommended in the SLES HA documentation and offered via the wizards.
> > It took me a while to find out what was going on.

> We've not observed this. Have you reported a bug?

Argl.... Yes I reported a bug. Yes I reported how to reproduce. Yes I provided 
a full description and offered a kernel dump...

> > When setting up a system with many (rather simple) resources funny things 
> > happen due to race conditions all over the place. (can be worked around 
> > mostly using arbitrary start-delay options.

> I've not encountered this either. Sorry for asking this, but: did you report 
> a bug?

Are you trying to make me angry?

> > Oh, did I mention that situations which are actually forbidden by 
> > constraints (e.g. using a score of INFINITY) actually do happen... 
> > Depending on the environment this can lead to not so funny effects.

> That would be a serious bug in the policy engine (and not just limited to SLE 
> HA 11 SP2).

Which does not really improve the situation for your customers.

> E.g. I defined the following constraints:
>
> colocation c17 inf: p_lsb_ccslogserver p_fs_daten order o34 inf:
> p_fs_daten p_lsb_ccslogserver:start
>
> I can proof from the logs that ccslogserver (an application) got
> migrated from node A to node B while p_fs_daten (a filesystem on top
> of drbd) was definitely still running on node A

I'd be very, very interested in seeing these logs. The rules you specified 
above should not allow for that, and I can't immediately imagine other rules 
that still might allow for it.

> Strangely enough, enterprise distributions target paying customers. This is 
> not, I believe, a SUSE-specific constraint.

It is a SuSE-specific constraint. Even with Microsoft I can report a bug 
without firstly buying an additional support contract beyond the existing 
license and the existing maintenance contract.

(In my case I do consulting work for a customer who wishes to evaluate if 
migrating to SLES 11 SP2 is an option for mission critical workloads.
I waited till SP2 was released before even starting the evaluation just to find 
out that SP2 fails in the simple test cases with configurations verbatimly 
copied from SLES HA documentation.
This customer buys SLES/RH/Windows licenses and support in bulk from a large 
multi-national. It is not feasible to buy in addition an extra support contract 
directly from SuSE just to be able to _report_ a bug or to provide a patch.)

> You can always file a bug against the upstream projects in the respective 
> communities; these will then possibly tell you to upgrade to latest upstream 
> first and reproduce.

Yes, I could do that.

Actually I coud provide a fully working and tested customized solution on top 
of OpenSuSE or Fedora based on uptodate upstream packages a local fixes but I 
was asked to answer if SLES11 SP2 provided a suitable HA solution for mission 
critical use cases.

Currently the result of this evaluation is that SLES 11 SP2 fails 
out-of-the-box for even the simple example cases.

> Eventually, these bugs will trickle back into the enterprise distributions as 
> well. That may just take a while.

Yes, and I am observing that SuSE is currently not able to provide upstream 
fixes in a timely manner even for simple

> But yes, SLE HA (and RHEL clustering too) sort-of target customers who have 
> support contracts, either will SUSE/RHT or a strong consulting partner (who 
> preferably is a high-grade technology partner with the distributor).

In this case my customer has a "high-grade technology partner" which of course 
has proper contracts with SuSE but my job is to

> I admit I don't find this particular complaint convincing.

I admit that I am unable to convince you that the fact that SLES11 SP2 fully 
uptodate does not work reliably even for the most simple use case with a setup 
copied verbatim from SLES 11 SP2 HA documentation.

Knowing that you are now since more than 12 years in the SuSE HA business it 
make me doubt if SLES is still an option for mission critical systems(**).

BTW: I was assuming that it was part of your job description to make sure that 
critical upstream/community fixes get integrated into the SLES 11 HA Extension 
in a timely manner. I guess that I am wrong.

Yours,
-- martin
(*) The log fills up with the same rather useless debug output every 30 seconds:
Aug 17 16:52:16 rt-lxcl9b clvmd[3227]: 33859776 got message from nodeid 
33859776 for 17082560. len 18
Aug 17 16:52:46 rt-lxcl9b clvmd[3227]: 33859776 got message from nodeid 
17082560 for 0. len 32
Aug 17 16:52:46 rt-lxcl9b clvmd[3227]: add_to_lvmqueue: cmd=0x7fc7f80008b0. 
client=0x6934a0, msg=0x7fc7fef76ffc, len=32, csid=0x7ffff6a020e4, xid=0
Aug 17 16:52:46 rt-lxcl9b clvmd[3227]: process_work_item: remote
Aug 17 16:52:46 rt-lxcl9b clvmd[3227]: process_remote_command unknown (0x2d) 
for clientid 0x5000000 XID 12916 on node 104a8c0
Aug 17 16:52:46 rt-lxcl9b clvmd[3227]: Syncing device names
Aug 17 16:52:46 rt-lxcl9b clvmd[3227]: LVM thread waiting for work
Aug 17 16:52:46 rt-lxcl9b clvmd[3227]: 33859776 got message from nodeid 
33859776 for 17082560. len 18
Aug 17 16:52:46 rt-lxcl9b clvmd[3227]: 33859776 got message from nodeid 
17082560 for 0. len 32
Aug 17 16:52:46 rt-lxcl9b clvmd[3227]: add_to_lvmqueue: cmd=0x7fc7f80008b0. 
client=0x6934a0, msg=0x7fc7fef771ac, len=32, csid=0x7ffff6a020e4, xid=0
Aug 17 16:52:46 rt-lxcl9b clvmd[3227]: process_work_item: remote
Aug 17 16:52:46 rt-lxcl9b clvmd[3227]: process_remote_command unknown (0x2d) 
for clientid 0x5000000 XID 12919 on node 104a8c0
Aug 17 16:52:46 rt-lxcl9b clvmd[3227]: Syncing device names
Aug 17 16:52:46 rt-lxcl9b clvmd[3227]: LVM thread waiting for work
Aug 17 16:52:46 rt-lxcl9b clvmd[3227]: 33859776 got message from nodeid 
33859776 for 17082560. len 18
(*) My main point is not about bugs which happen to be normal with complex 
systems. My concern is about how you/SuSE handles these bugs.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to