Hi,
On Wed, Aug 18, 2010 at 10:46:41AM +0900, Simon Horman wrote:
> On Tue, Aug 17, 2010 at 07:21:40PM -0600, Tim Serong wrote:
> > On 8/18/2010 at 10:25 AM, Simon Horman <[email protected]> wrote:
> > > On Tue, Aug 17, 2010 at 06:12:04PM -0600, Tim Serong wrote:
> > > > On 8/18/2010 at 09:03 AM, Simon Horman <[email protected]> wrote:
> > > > > On Tue, Aug 17, 2010 at 03:06:45PM +0200, Dejan Muhamedagic wrote:
> > > > > > Hi,
> > > > > >
> > > > > > On Tue, Aug 17, 2010 at 04:50:27PM +0900, Simon Horman wrote:
> > > > > > > On Wed, Jul 21, 2010 at 01:41:09AM -0600, Tim Serong wrote:
> > > > > > > > Hi All,
> > > > > > > >
> > > > > > > > A while ago (April, from memory), there was an ABI change in
> > > > > > > > clplumbing in cluster-glue. Presumably this went mostly
> > > > > > > > unnoticed
> > > > > > > > in general usage, however I have twice seen systems where the
> > > > > > > > cluster
> > > > > > > > could not run because of a missing (or incorrect) libglue2
> > > > > > > > package.
> > > > > > > > One was my development system, with a dodgy build, the other
> > > > > > > > was
> > > > > > > > mentioned on #linux-ha yesterday, and was the result of
> > > > > > > > ignoring a
> > > > > > > > conflict error when installing the pacemaker RPM on openSUSE.
> > > > > > > > So,
> > > > > > > > let me be clear, this is not something anyone should need to
> > > > > > > > worry
> > > > > > > > about... But I thought I'd mention it here, because the error
> > > > > > > > messages you get are, IMO, not very obvious.
> > > > > > > >
> > > > > > > > Symptoms of a mismatched pacemaker/libglue build are errors
> > > > > > > > like:
> > > > > > > >
> > > > > > > > lrmd: [3004]: ERROR:
> > > > > > > > main: can not create wait connection for command.
> > > > > > > > lrmd: [3004]: ERROR:
> > > > > > > > Startup aborted (can't create comm channel). Shutting
> > > > > > > > down.
> > > > > > > > ...
> > > > > > > > pengine: [4011]: ERROR:
> > > > > > > > init_client_ipc_comms_nodispatch: Could not access channel
> > > > > > > > on:
> > > > > > > > /var/run/crm/pengine
> > > > > > > > corosync[4000]: [pcmk ] ERROR:
> > > > > > > > pcmk_wait_dispatch: Child process pengine exited (pid=4011,
> > > > > > > > rc=1)
> > > > > > > > corosync[4000]: [pcmk ] notice:
> > > > > > > > pcmk_wait_dispatch: Respawning failed child process:
> > > > > > > > pengine
> > > > > > > >
> > > > > > > > If your cluster won't start and you see this in
> > > > > > > > /var/log/messages,
> > > > > > > > make sure libglue2 is up to date. And now that I've mentioned
> > > > > > > > this
> > > > > > > > here and it's made it to the mailing list archive, Google will
> > > > > > > > know,
> > > > > > > > and nobody else will ever have this problem again.
> > > > > > > >
> > > > > > > > This has been a public service announcement. Thank you for
> > > > > > > > reading.
> > > > > > >
> > > > > > > Could we get the .so bumped accordingly in the next release of
> > > > > > > cluster glue? That would at least help in managing the problem
> > > > > > > once the new release has been made.
> > > > > >
> > > > > > I don't think that that is necessary. The ABI change in the
> > > > > > _released_ cluster-glue packages was done in such a way as not to
> > > > > > disturb the existing pacemaker installations, i.e. by adding
> > > > > > fields to the end of the struct. Further, the library version has
> > > > > > been bumped to 3:0:1 (with libtool's -version-info) at the time.
> > > > > > For whatever reason that translates to so.2.1.0. Users of the new
> > > > > > ABI are also using domain sockets of the new type if they want
> > > > > > the new functionality.
> > > > > >
> > > > > > I guess that what Tim was seeing was Pacemaker built against the
> > > > > > unreleased glue versions which did have different ABI, i.e. the
> > > > > > fields were inserted somewhere in the middle of the struct.
> > > > >
> > > > > Ok, so no ABI incompatibility was introduced in 1.0.6. Great!
> > > > > I will go ahead and close the related Debian bugs,
> > > > > #593319, #593321, #593322 and #593323.
> > > >
> > > > I was seeing Pacemaker *built* against new glue, installed on a system
> > > > that had *old* glue installed, because both libglue2 (new glue) and
> > > > libheartbeat2 < 3.0 (old glue) provide what looks like the same DSO;
> > > > so when Pacemaker was upgraded on this system, libheartbeat2 was not
> > > > automatically upgraded to libglue2. For reference, there's an
> > > > openSUSE 11.3 bug for this:
> > > >
> > > > https://bugzilla.novell.com/show_bug.cgi?id=628243
> > > >
> > > > I believe this may only be a problem on openSUSE 11.3, where heartbeat
> > > > 2.99.3 still exists, providing old libheartbeat2.
> > > >
> > > > It shouldn't be a problem the other way around (i.e. old Pacemaker is
> > > > meant to work with new glue, as Dejan said).
> > >
> > > Understood.
> > >
> > > Was the new glue that you used for building a released version
> > > or an hg snapshot?
> >
> > The first time I saw it was on with an odd build around about the time
> > of glue 1.0.4 or 1.0.5 (with which there was definitely a problem,
> > see http://www.gossamer-threads.com/lists/linuxha/dev/63396).
> >
> > The issue on openSUSE 11.3 is with Pacemaker built against glue slightly
> > newer than 1.0.5 (changeset 1448deafdf79), but installed with libheartbeat2
> > 2.99.x instead of libglue2.
> >
> > I have not tried Pacemaker built against glue 1.0.5, but installed with
> > an earlier glue (e.g. 1.0.4 or earlier). I expect this would break in the
> > same way I mentioned originally.
> >
> > I had a quick look at the Debian bugs you mentioned. If it's possible at
> > all on Debian to have glue < 1.0.5 installed with Pacemaker built against
> > glue >= 1.0.5, I expect there will be trouble. However, a quick search
> > on packages.debian.org shows no glue earlier than 1.0.5, so hopefully
> > this means you're good.
>
> Hi Tim,
>
> If we disregard unstable, which I think is reasonable, and look at testing,
> then the only versions of cluster-glue that have ever existed in Debian are
> 1.0.5-2 and 1.0.6-1 [1]. So it sounds like we should be ok.
Looking again at the whole matter, it is possible to run into
problems if one installs a pacemaker built against a new glue
release (>=1.0.5), but tries to run it with some older glue
release (<1.0.5). The reason is here (from
include/clplumbing/ipc.h):
/* Unix domain socket with farside uid + gid credentials.
* Available since libplumb.so.2.1.0 */
#define IPC_UDS_CRED "uds_c"
#ifdef IPC_UDS_CRED
# define IPC_ANYTYPE IPC_UDS_CRED
#else
# error "No IPC types defined(!)"
#endif
uds_c didn't exist before. Before, IPC_ANYTYPE was defined to be
IPC_DOMAIN_SOCKET ("uds"). Must say that I don't know why that
changed. Users who needed "uds_c" should've asked for it
explicitely.
Cheers,
Dejan
> For the record, there has never been a release of Debian stable that
> included cluster-glue - it will appear in Squeeze for the first time.
>
> [1] http://packages.qa.debian.org/c/cluster-glue.html
>
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems