Hi,
On Wed, Mar 25, 2009 at 05:58:52PM +0100, racciari wrote:
> Hi,
>
> we are using heartbeat in a 6+1 cluster on Linux. It works fine
> until we updated our configuration from 60 resources (8 by
> servers + 12 on backup) to 132 resources (20 by servers + 12 on
> backup). The new cib.xml is near 2000 lines. Updating the cib
> using cibadmin to inject new resources is ok, resources starts
> correctly.
Impressive.
> We encountered problems when restarting a node of the cluster asynchronously
> (ie: not a the same time than the others): the node joining the cluster is
> trying to update its cib but it always failed with this messages:
> <!-- /* Font Definitions */ @font-face {font-family:"Cambria Math";
> panose-1:2 4 5 3 5 4 6 3 2 4; mso-font-alt:"Calisto MT"; mso-font-charset:0;
> mso-generic-font-family:roman; mso-font-pitch:variable;
> mso-font-signature:-1610611985 1107304683 0 0 159 0;} @font-face
> {font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4; mso-font-alt:"Times New
> Roman"; mso-font-charset:0; mso-generic-font-family:swiss;
> mso-font-pitch:variable; mso-font-signature:-1610611985 1073750139 0 0 159
> 0;} /* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal
> {mso-style-unhide:no; mso-style-qformat:yes; mso-style-parent:""; margin:0cm;
> margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:11.0pt;
> font-family:"Calibri","sans-serif"; mso-fareast-font-family:Calibri;
> mso-fareast-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman";}
> .MsoChpDefault {mso-style-type:export-only; mso-default-props:yes;
> font-size:10.0pt; mso-ansi-font-size:10.0pt; mso-bidi-font-size:10.0pt;}
> @page Section1 {size:612.0pt 792.0pt; margin:70.85pt 70.85pt 70.85pt 70.85pt;
> mso-header-margin:36.0pt; mso-footer-margin:36.0pt; mso-paper-source:0;}
> div.Section1 {page:Section1;} -->
Friendly mailer ;-)
> cib[6201]: 2009/03/19_01:34:16 info: cib_process_diff: Diff 0.112.13 ->
> 0.112.14 not applied to 0.111.1: current "epoch" is less than required
> cib[6201]: 2009/03/19_01:34:16 info: cib_process_diff: Requesting re-sync
> from peer: current "epoch" is less than required
> cib[6201]: 2009/03/19_01:34:16 WARN: do_cib_notify: cib_apply_diff of <diff
> > FAILED: Application of an update diff failed, requesting a full refresh
> cib[6201]: 2009/03/19_01:34:16 WARN: cib_process_diff: Not applying diff
> 0.112.14 -> 0.112.15 (sync in progress)
> cib[6201]: 2009/03/19_01:34:16 WARN: do_cib_notify: cib_apply_diff of <diff
> > FAILED: Application of an update diff failed, requesting a full refresh
cib (the program) tries first to apply the diff and if that fails
it should upload the whole CIB.
> >From the other nodes, the joining nodes is "standby" and logs
> >growing showing the messages above. The new node sees the
> >cluster offline and still try to update the cib. The major
> >problems seems to be:
>
> heartbeat[6192]: 2009/03/19_01:43:08 WARN: Gmain_timeout_dispatch: Dispatch
> function for retransmit request took too long to execute: 20 ms (> 10 ms)
> (GSource: 0x8941458)
This is not a major problem.
> "Manual" updating through cibadmin did not give any result.
>
> The heartbeat process seems to loop on this problem and the
> cluster becomes unstable so that we have to restart the whole
> cluster. When restarting all the servers with same
> configuration at the same time, everything works fine.
>
> The servers are not loaded, network traffic is low, I can't
> explain the problem and I can reproduce it on other servers
> with similar configuration.
Other nodes in the same cluster or in another cluster?
> When trying to increase the hardcoded dispatch delay, here the new message:
> ERROR: Irretrievably lost packet: node dsfrg0111 seq XXXX
> ??
> Anyone has a similar cluster configuration with many resources that works ?
> How to resolve this problem ?
Not much anyone can do without more information, in particular
more logs. Which version do you run? Does it support hb_report?
If so, please reproduce and file a bugzilla and attach a
hb_report generated report.
Thanks,
Dejan
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems