Re: [Linux-HA] System rebooted during rolling upgrade

Serge Dubrouski Thu, 01 May 2008 08:26:41 -0700

Ok. Your packages are right, The problem here is that it's not clear
who currently supports heartbeat-* packages. I know for sure that a
fixed version of pgsql was submitted into Mercurial by Dejan but it
looks like heartbeat-* packages weren't rebuilt and still include
2.1.3 stuff. Unfortunately current support/release situation for the
project isn't clear but I know that guys at Novell are trying to
improve it.


Attached is a fixed version of pgsql OCF.

On Thu, May 1, 2008 at 7:50 AM, Doug Knight <[EMAIL PROTECTED]> wrote:
> Serge,
>  I installed the following RPMs:
>
>  libnet-1.1.2.1-1.1.x86_64.rpm
>  heartbeat-resources-2.1.3-21.1.x86_64.rpm
>  heartbeat-common-2.1.3-21.1.x86_64.rpm
>  heartbeat-2.1.3-21.1.x86_64.rpm
>  pacemaker-heartbeat-0.6.3-6.1.x86_64.rpm
>  pacemaker-pygui-1.3-6.1.x86_64.rpm
>
>  On the system I was upgrading, yes, I had a stand-alone (non-HB managed)
>  8.3.1 version running, which your suspicions are correct, it caused the
>  pgsql to get confused. The version that came with 2.1.3 only checks for
>  the postmaster process, it does not use the environment variables to
>  check for the specific instance. Reverting back to the version I had
>  been using before the upgrade worked, as it checks for the postgres root
>  and associated postmaster.pid file, THEN actually looks for the process
>  ID found there.
>
>  Doug
>
>
>
>  On Thu, 2008-05-01 at 06:39 -0600, Serge Dubrouski wrote:
>
>  > Pick up a new pgsql from the Pacemaker repositary. Though 2.1.3 is the
>  > latest available build for Heartbeat in the old version of packaging
>  > it's known for having so major issues. So I'd recommend to switch to
>  > Pacemaker (Heartbet version). Some time ago lars and Andrew announced
>  > that they'll issue 2.1.4 in the old packaging but so far it didn't
>  > happen.
>  >
>  > For my reference. Do you have several instances of PostgreSQL running
>  > on the same node? pgsql from 2.1.3 shouldn't have any problems unless
>  > you do.
>  >
>  > On Thu, May 1, 2008 at 6:10 AM, Doug Knight <[EMAIL PROTECTED]> wrote:
>  > > Serge,
>  > >  I kept the one I had been using from the previous install, which seems
>  > >  to work OK. I thought 2.1.3 was the latest build? Which build do you
>  > >  recommend I grab pgsql from? Using the old pgsql I was able to bring up
>  > >  heartbeat on the secondary. Today I upgrade the primary, so it would be
>  > >  nice for someone to suggest how to prevent the emergency reboot from
>  > >  happening. I plan on "unmanaging" the resources during the upgrade, as
>  > >  opposed to taking them down.
>  > >
>  > >  Thanks,
>  > >  Doug
>  > >
>  > >
>  > >
>  > >  On Wed, 2008-04-30 at 13:40 -0600, Serge Dubrouski wrote:
>  > >
>  > >  > Doug -
>  > >  >
>  > >  > Regarding pgsql. The version of that OCF that comes with 2.1.3 isn't
>  > >  > "the best" one. I'd strongly recommend to get a  newer one from the
>  > >  > later builds.
>  > >  >
>  > >  > On Wed, Apr 30, 2008 at 1:32 PM, Doug Knight <[EMAIL PROTECTED]> 
> wrote:
>  > >  > > Hi,
>  > >  > >  I am performing a rolling upgrade on a RHEL5 system. Old HA was 
> 2.0.8,
>  > >  > >  upgrading to 2.1.3, Primary is 2.0.8 and up, secondary was the one 
> being
>  > >  > >  upgraded. During the startup I encountered some issues with my OCF
>  > >  > >  scripts for our applications, which I have now corrected (mainly 
> the
>  > >  > >  relocation of the ocf-shellfuncs, etc). The upgraded node did come 
> up
>  > >  > >  and connect to the primary server (though it decided to try 
> restarting
>  > >  > >  postgres locally when it wasn't supposed to, more in a later email,
>  > >  > >  maybe). There are two things that concern me. First, I saw a 
> warning as
>  > >  > >  follows:
>  > >  > >
>  > >  > >  WARN: crm_peer_init: Set these options via openais.conf
>  > >  > >
>  > >  > >  I did not install AIS, I stayed with the heartbeat-only stack
>  > >  > >  (heartbeat, common, resource, heartbeat-pacemaker, etc). Should I 
> be
>  > >  > >  concerned about this warning, and if so what should I do about it?
>  > >  > >
>  > >  > >  Second, once I let the systems settle out and the logs got quiet, I
>  > >  > >  checked status on my resources. As noted previously, pgsql had 
> problems.
>  > >  > >  I attempted to clean pgsql (crm_resource -C -r pgsql_5432, which 
> stated
>  > >  > >  I needed to use -H, which I did), and I got an emergency condition 
> in
>  > >  > >  heartbeat and it rebooted my server! So aside from the pgsql 
> issue, how
>  > >  > >  can I prevent heartbeat from doing a reboot? There are other things
>  > >  > >  running on this server which a reboot plays havoc with, so I would 
> like
>  > >  > >  to avoid a repeat if possible.
>  > >  > >
>  > >  > >  Thanks,
>  > >  > >  Doug Knight
>  > >  > >  WSI Corp
>  > >  > >  p.s. below is the log from the point that pgsql was frozen to the
>  > >  > >  reboot.
>  > >  > >
>  > >  > >  lrmd[5476]: 2008/04/30_13:54:03 notice: on_msg_perform_op: resource
>  > >  > >  pgsql_5432 is frozen, no ops can run.
>  > >  > >  crmd[5479]: 2008/04/30_13:54:03 ERROR: do_lrm_rsc_op: Operation 
> monitor
>  > >  > >  on pgsql_5432 failed: -1
>  > >  > >  crmd[5479]: 2008/04/30_13:54:03 WARN: do_log: [[FSA]] Input I_FAIL 
> from
>  > >  > >  do_lrm_rsc_op() received in state (S_NOT_DC)
>  > >  > >  crmd[5479]: 2008/04/30_13:54:03 info: do_state_transition: State
>  > >  > >  transition S_NOT_DC -> S_RECOVERY [ input=I_FAIL 
> cause=C_FSA_INTERNAL
>  > >  > >  origin=do_lrm_rsc_op ]
>  > >  > >  crmd[5479]: 2008/04/30_13:54:03 ERROR: do_recover: Action A_RECOVER
>  > >  > >  (0000000001000000) not supported
>  > >  > >  crmd[5479]: 2008/04/30_13:54:03 ERROR: do_log: [[FSA]] Input 
> I_TERMINATE
>  > >  > >  from do_recover() received in state (S_RECOVERY)
>  > >  > >  crmd[5479]: 2008/04/30_13:54:03 info: do_state_transition: State
>  > >  > >  transition S_RECOVERY -> S_TERMINATE [ input=I_TERMINATE
>  > >  > >  cause=C_FSA_INTERNAL origin=do_recover ]
>  > >  > >  crmd[5479]: 2008/04/30_13:54:03 info: do_shutdown: All subsystems
>  > >  > >  stopped, continuing
>  > >  > >  crmd[5479]: 2008/04/30_13:54:03 ERROR: verify_stopped: Resource
>  > >  > >  get_vortex_rpm_conus_4km_ingestor_HA was active at shutdown.  You 
> may
>  > >  > >  ignore this error if it is unmanaged.
>  > >  > >  crmd[5479]: 2008/04/30_13:54:03 ERROR: verify_stopped: Resource
>  > >  > >  get_vortex_rpm_ingestor_HA was active at shutdown.  You may ignore 
> this
>  > >  > >  error if it is unmanaged.
>  > >  > >  crmd[5479]: 2008/04/30_13:54:03 ERROR: verify_stopped: Resource
>  > >  > >  pgsql_5432 was active at shutdown.  You may ignore this error if 
> it is
>  > >  > >  unmanaged.
>  > >  > >  crmd[5479]: 2008/04/30_13:54:03 ERROR: verify_stopped: Resource
>  > >  > >  get_sat_hd_vsir_ingestor_HA was active at shutdown.  You may 
> ignore this
>  > >  > >  error if it is unmanaged.
>  > >  > >  crmd[5479]: 2008/04/30_13:54:03 ERROR: verify_stopped: Resource
>  > >  > >  get_vortex_etagfs_ingestor_HA was active at shutdown.  You may 
> ignore
>  > >  > >  this error if it is unmanaged.
>  > >  > >  crmd[5479]: 2008/04/30_13:54:03 info: do_lrm_control: Disconnected 
> from
>  > >  > >  the LRM
>  > >  > >  ccm[5474]: 2008/04/30_13:54:03 info: client (pid=5479) removed 
> from ccm
>  > >  > >  crmd[5479]: 2008/04/30_13:54:03 info: do_ha_control: Disconnected 
> from
>  > >  > >  Heartbeat
>  > >  > >  crmd[5479]: 2008/04/30_13:54:03 info: do_cib_control: 
> Disconnecting CIB
>  > >  > >  crmd[5479]: 2008/04/30_13:54:03 info: crmd_cib_connection_destroy:
>  > >  > >  Connection to the CIB terminated...
>  > >  > >  crmd[5479]: 2008/04/30_13:54:03 info: do_exit: Performing A_EXIT_0 
> -
>  > >  > >  gracefully exiting the CRMd
>  > >  > >  crmd[5479]: 2008/04/30_13:54:03 ERROR: do_exit: Could not recover 
> from
>  > >  > >  internal error
>  > >  > >  crmd[5479]: 2008/04/30_13:54:03 info: free_mem: Dropping 
> I_TERMINATE:
>  > >  > >  [ state=S_TERMINATE cause=C_FSA_INTERNAL origin=do_stop ]
>  > >  > >  crmd[5479]: 2008/04/30_13:54:03 info: destroy_crm_node: Destroying 
> entry
>  > >  > >  for node 1
>  > >  > >  cib[5475]: 2008/04/30_13:54:03 WARN: send_via_callback_channel: 
> Client
>  > >  > >  4e142eb9-e202-4a30-98f0-de2091d78976 has disconnected
>  > >  > >  crmd[5479]: 2008/04/30_13:54:03 info: destroy_crm_node: Destroying 
> entry
>  > >  > >  for node 0
>  > >  > >  cib[5475]: 2008/04/30_13:54:03 WARN: do_local_notify: A-Sync reply 
> to
>  > >  > >  5479 failed: client left before we could send reply
>  > >  > >  crmd[5479]: 2008/04/30_13:54:03 info: do_exit: [crmd] stopped (2)
>  > >  > >  heartbeat[5455]: 2008/04/30_13:54:03 WARN:
>  > >  > >  Managed /usr/lib64/heartbeat/crmd process 5479 exited with return 
> code
>  > >  > >  2.
>  > >  > >  heartbeat[5455]: 2008/04/30_13:54:03 EMERG: Rebooting system.
>  > >  > >  Reason: /usr/lib64/heartbeat/crmd
>  > >  > >
>  > >  > >  _______________________________________________
>  > >  > >  Linux-HA mailing list
>  > >  > >  [email protected]
>  > >  > >  http://lists.linux-ha.org/mailman/listinfo/linux-ha
>  > >  > >  See also: http://linux-ha.org/ReportingProblems
>  > >  > >
>  > >  >
>  > >  >
>  > >  >
>  > >  _______________________________________________
>  > >  Linux-HA mailing list
>  > >  [email protected]
>  > >  http://lists.linux-ha.org/mailman/listinfo/linux-ha
>  > >  See also: http://linux-ha.org/ReportingProblems
>  > >
>  >
>  >
>  >
>  _______________________________________________
>  Linux-HA mailing list
>  [email protected]
>  http://lists.linux-ha.org/mailman/listinfo/linux-ha
>  See also: http://linux-ha.org/ReportingProblems
>



-- 
Serge Dubrouski.

pgsql
Description: Binary data

_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Re: [Linux-HA] System rebooted during rolling upgrade

Reply via email to