Hi,

On Wed, Aug 18, 2010 at 10:46:41AM +0900, Simon Horman wrote:
> On Tue, Aug 17, 2010 at 07:21:40PM -0600, Tim Serong wrote:
> > On 8/18/2010 at 10:25 AM, Simon Horman <[email protected]> wrote: 
> > > On Tue, Aug 17, 2010 at 06:12:04PM -0600, Tim Serong wrote: 
> > > > On 8/18/2010 at 09:03 AM, Simon Horman <[email protected]> wrote:  
> > > > > On Tue, Aug 17, 2010 at 03:06:45PM +0200, Dejan Muhamedagic wrote:  
> > > > > > Hi,  
> > > > > >   
> > > > > > On Tue, Aug 17, 2010 at 04:50:27PM +0900, Simon Horman wrote:  
> > > > > > > On Wed, Jul 21, 2010 at 01:41:09AM -0600, Tim Serong wrote:  
> > > > > > > > Hi All,  
> > > > > > > >   
> > > > > > > > A while ago (April, from memory), there was an ABI change in  
> > > > > > > > clplumbing in cluster-glue.  Presumably this went mostly 
> > > > > > > > unnoticed  
> > > > > > > > in general usage, however I have twice seen systems where the 
> > > > > > > > cluster  
> > > > > > > > could not run because of a missing (or incorrect) libglue2 
> > > > > > > > package.  
> > > > > > > > One was my development system, with a dodgy build, the other 
> > > > > > > > was  
> > > > > > > > mentioned on #linux-ha yesterday, and was the result of 
> > > > > > > > ignoring a  
> > > > > > > > conflict error when installing the pacemaker RPM on openSUSE.  
> > > > > > > > So,  
> > > > > > > > let me be clear, this is not something anyone should need to 
> > > > > > > > worry  
> > > > > > > > about...  But I thought I'd mention it here, because the error  
> > > > > > > > messages you get are, IMO, not very obvious.  
> > > > > > > >   
> > > > > > > > Symptoms of a mismatched pacemaker/libglue build are errors 
> > > > > > > > like:  
> > > > > > > >   
> > > > > > > >   lrmd: [3004]: ERROR:  
> > > > > > > >     main: can not create wait connection for command.  
> > > > > > > >   lrmd: [3004]: ERROR:  
> > > > > > > >     Startup aborted (can't create comm channel).  Shutting 
> > > > > > > > down.  
> > > > > > > >   ...  
> > > > > > > >   pengine: [4011]: ERROR:  
> > > > > > > >     init_client_ipc_comms_nodispatch: Could not access channel 
> > > > > > > > on:  
> > > > > > > >     /var/run/crm/pengine  
> > > > > > > >   corosync[4000]: [pcmk  ] ERROR:  
> > > > > > > >     pcmk_wait_dispatch: Child process pengine exited (pid=4011, 
> > > > > > > > rc=1)  
> > > > > > > >   corosync[4000]: [pcmk  ] notice:  
> > > > > > > >     pcmk_wait_dispatch: Respawning failed child process: 
> > > > > > > > pengine  
> > > > > > > >   
> > > > > > > > If your cluster won't start and you see this in 
> > > > > > > > /var/log/messages,  
> > > > > > > > make sure libglue2 is up to date.  And now that I've mentioned 
> > > > > > > > this  
> > > > > > > > here and it's made it to the mailing list archive, Google will 
> > > > > > > > know,  
> > > > > > > > and nobody else will ever have this problem again.  
> > > > > > > >   
> > > > > > > > This has been a public service announcement.  Thank you for 
> > > > > > > > reading.  
> > > > > > >   
> > > > > > > Could we get the .so bumped accordingly in the next release of  
> > > > > > > cluster glue? That would at least help in managing the problem  
> > > > > > > once the new release has been made.  
> > > > > >   
> > > > > > I don't think that that is necessary. The ABI change in the  
> > > > > > _released_ cluster-glue packages was done in such a way as not to  
> > > > > > disturb the existing pacemaker installations, i.e. by adding  
> > > > > > fields to the end of the struct. Further, the library version has  
> > > > > > been bumped to 3:0:1 (with libtool's -version-info) at the time.  
> > > > > > For whatever reason that translates to so.2.1.0. Users of the new  
> > > > > > ABI are also using domain sockets of the new type if they want  
> > > > > > the new functionality.  
> > > > > >   
> > > > > > I guess that what Tim was seeing was Pacemaker built against the  
> > > > > > unreleased glue versions which did have different ABI, i.e. the  
> > > > > > fields were inserted somewhere in the middle of the struct.  
> > > > >   
> > > > > Ok, so no ABI incompatibility was introduced in 1.0.6. Great!  
> > > > > I will go ahead and close the related Debian bugs,  
> > > > > #593319, #593321, #593322 and #593323.  
> > > >  
> > > > I was seeing Pacemaker *built* against new glue, installed on a system 
> > > > that had *old* glue installed, because both libglue2 (new glue) and 
> > > > libheartbeat2 < 3.0 (old glue) provide what looks like the same DSO; 
> > > > so when Pacemaker was upgraded on this system, libheartbeat2 was not 
> > > > automatically upgraded to libglue2.  For reference, there's an 
> > > > openSUSE 11.3 bug for this: 
> > > >  
> > > >   https://bugzilla.novell.com/show_bug.cgi?id=628243 
> > > >  
> > > > I believe this may only be a problem on openSUSE 11.3, where heartbeat 
> > > > 2.99.3 still exists, providing old libheartbeat2. 
> > > >  
> > > > It shouldn't be a problem the other way around (i.e. old Pacemaker is 
> > > > meant to work with new glue, as Dejan said). 
> > >  
> > > Understood. 
> > >  
> > > Was the new glue that you used for building a released version 
> > > or an hg snapshot? 
> > 
> > The first time I saw it was on with an odd build around about the time
> > of glue 1.0.4 or 1.0.5 (with which there was definitely a problem,
> > see http://www.gossamer-threads.com/lists/linuxha/dev/63396). 
> > 
> > The issue on openSUSE 11.3 is with Pacemaker built against glue slightly
> > newer than 1.0.5 (changeset 1448deafdf79), but installed with libheartbeat2
> > 2.99.x instead of libglue2.
> > 
> > I have not tried Pacemaker built against glue 1.0.5, but installed with
> > an earlier glue (e.g. 1.0.4 or earlier).  I expect this would break in the
> > same way I mentioned originally.
> > 
> > I had a quick look at the Debian bugs you mentioned.  If it's possible at
> > all on Debian to have glue < 1.0.5 installed with Pacemaker built against
> > glue >= 1.0.5, I expect there will be trouble.  However, a quick search
> > on packages.debian.org shows no glue earlier than 1.0.5, so hopefully
> > this means you're good.
> 
> Hi Tim,
> 
> If we disregard unstable, which I think is reasonable, and look at testing,
> then the only versions of cluster-glue that have ever existed in Debian are
> 1.0.5-2 and 1.0.6-1 [1]. So it sounds like we should be ok.

Looking again at the whole matter, it is possible to run into
problems if one installs a pacemaker built against a new glue
release (>=1.0.5), but tries to run it with some older glue
release (<1.0.5). The reason is here (from
include/clplumbing/ipc.h):

/* Unix domain socket with farside uid + gid credentials.
 * Available since libplumb.so.2.1.0 */
#define IPC_UDS_CRED        "uds_c"

#ifdef IPC_UDS_CRED
#   define  IPC_ANYTYPE     IPC_UDS_CRED
#else
#   error "No IPC types defined(!)"
#endif

uds_c didn't exist before. Before, IPC_ANYTYPE was defined to be
IPC_DOMAIN_SOCKET ("uds"). Must say that I don't know why that
changed. Users who needed "uds_c" should've asked for it
explicitely.

Cheers,

Dejan


> For the record, there has never been a release of Debian stable that
> included cluster-glue - it will appear in Squeeze for the first time.
> 
> [1] http://packages.qa.debian.org/c/cluster-glue.html
> 
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to