Hi,

On Wed, Mar 25, 2009 at 05:58:52PM +0100, racciari wrote:
> Hi,
> 
> we are using heartbeat in a 6+1 cluster on Linux. It works fine
> until we updated our configuration from 60 resources (8 by
> servers + 12 on backup) to 132 resources (20 by servers + 12 on
> backup). The new cib.xml is near 2000 lines.  Updating the cib
> using cibadmin to inject new resources is ok, resources starts
> correctly.

Impressive.

> We encountered problems when restarting a node of the cluster asynchronously 
> (ie: not a the same time than the others): the node joining the cluster is 
> trying to update its cib but it always failed with this messages:
>  <!--  /* Font Definitions */  @font-face {font-family:"Cambria Math"; 
> panose-1:2 4 5 3 5 4 6 3 2 4; mso-font-alt:"Calisto MT"; mso-font-charset:0; 
> mso-generic-font-family:roman; mso-font-pitch:variable; 
> mso-font-signature:-1610611985 1107304683 0 0 159 0;} @font-face 
> {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4; mso-font-alt:"Times New 
> Roman"; mso-font-charset:0; mso-generic-font-family:swiss; 
> mso-font-pitch:variable; mso-font-signature:-1610611985 1073750139 0 0 159 
> 0;}  /* Style Definitions */  p.MsoNormal, li.MsoNormal, div.MsoNormal 
> {mso-style-unhide:no; mso-style-qformat:yes; mso-style-parent:""; margin:0cm; 
> margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:11.0pt; 
> font-family:"Calibri","sans-serif"; mso-fareast-font-family:Calibri; 
> mso-fareast-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman";} 
> .MsoChpDefault {mso-style-type:export-only; mso-default-props:yes; 
> font-size:10.0pt; mso-ansi-font-size:10.0pt; mso-bidi-font-size:10.0pt;} 
> @page Section1 {size:612.0pt 792.0pt; margin:70.85pt 70.85pt 70.85pt 70.85pt; 
> mso-header-margin:36.0pt; mso-footer-margin:36.0pt; mso-paper-source:0;} 
> div.Section1 {page:Section1;} -->   

Friendly mailer ;-)

> cib[6201]: 2009/03/19_01:34:16 info: cib_process_diff: Diff 0.112.13 -> 
> 0.112.14 not applied to 0.111.1: current "epoch" is less than required
>   cib[6201]: 2009/03/19_01:34:16 info: cib_process_diff: Requesting re-sync 
> from peer: current "epoch" is less than required
>   cib[6201]: 2009/03/19_01:34:16 WARN: do_cib_notify: cib_apply_diff of <diff 
> > FAILED: Application of an update diff failed, requesting a full refresh
>     cib[6201]: 2009/03/19_01:34:16 WARN: cib_process_diff: Not applying diff 
> 0.112.14 -> 0.112.15 (sync in progress)
>   cib[6201]: 2009/03/19_01:34:16 WARN: do_cib_notify: cib_apply_diff of <diff 
> > FAILED: Application of an update diff failed, requesting a full refresh

cib (the program) tries first to apply the diff and if that fails
it should upload the whole CIB.

> >From the other nodes, the joining nodes is "standby" and logs
> >growing showing the messages above. The new node sees the
> >cluster offline and still try to update the cib. The major
> >problems seems to be:
> 
> heartbeat[6192]: 2009/03/19_01:43:08 WARN: Gmain_timeout_dispatch: Dispatch 
> function for retransmit request took too long to execute: 20 ms (> 10 ms) 
> (GSource: 0x8941458)

This is not a major problem.

> "Manual" updating through cibadmin did not give any result.
> 
> The heartbeat process seems to loop on this problem and the
> cluster becomes unstable so that we have to restart the whole
> cluster. When restarting all the servers with same
> configuration at the same time, everything works fine.
> 
> The servers are not loaded, network traffic is low, I can't
> explain the problem and I can reproduce it on other servers
> with similar configuration.

Other nodes in the same cluster or in another cluster?

> When trying to increase the hardcoded dispatch delay, here the new message:
> ERROR: Irretrievably lost packet: node dsfrg0111 seq XXXX
> ??
> Anyone has a similar cluster configuration with many resources that works ?
> How to resolve this problem ?

Not much anyone can do without more information, in particular
more logs. Which version do you run? Does it support hb_report?
If so, please reproduce and file a bugzilla and attach a
hb_report generated report.

Thanks,

Dejan
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to