Hi,
we are using heartbeat in a 6+1 cluster on Linux. It works fine until we
updated our configuration from 60 resources (8 by servers + 12 on backup) to
132 resources (20 by servers + 12 on backup). The new cib.xml is near 2000
lines.
Updating the cib using cibadmin to inject new resources is ok, resources starts
correctly.
We encountered problems when restarting a node of the cluster asynchronously
(ie: not a the same time than the others): the node joining the cluster is
trying to update its cib but it always failed with this messages:
<!-- /* Font Definitions */ @font-face {font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4; mso-font-alt:"Calisto MT"; mso-font-charset:0;
mso-generic-font-family:roman; mso-font-pitch:variable;
mso-font-signature:-1610611985 1107304683 0 0 159 0;} @font-face
{font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4; mso-font-alt:"Times New
Roman"; mso-font-charset:0; mso-generic-font-family:swiss;
mso-font-pitch:variable; mso-font-signature:-1610611985 1073750139 0 0 159 0;}
/* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal
{mso-style-unhide:no; mso-style-qformat:yes; mso-style-parent:""; margin:0cm;
margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:11.0pt;
font-family:"Calibri","sans-serif"; mso-fareast-font-family:Calibri;
mso-fareast-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman";}
.MsoChpDefault {mso-style-type:export-only; mso-default-props:yes;
font-size:10.0pt; mso-ansi-font-size:10.0pt; mso-bidi-font-size:10.0pt;} @page
Section1 {size:612.0pt 792.0pt; margin:70.85pt 70.85pt 70.85pt 70.85pt;
mso-header-margin:36.0pt; mso-footer-margin:36.0pt; mso-paper-source:0;}
div.Section1 {page:Section1;} -->
cib[6201]: 2009/03/19_01:34:16 info: cib_process_diff: Diff 0.112.13 ->
0.112.14 not applied to 0.111.1: current "epoch" is less than required
cib[6201]: 2009/03/19_01:34:16 info: cib_process_diff: Requesting re-sync
from peer: current "epoch" is less than required
cib[6201]: 2009/03/19_01:34:16 WARN: do_cib_notify: cib_apply_diff of <diff >
FAILED: Application of an update diff failed, requesting a full refresh
cib[6201]: 2009/03/19_01:34:16 WARN: cib_process_diff: Not applying diff
0.112.14 -> 0.112.15 (sync in progress)
cib[6201]: 2009/03/19_01:34:16 WARN: do_cib_notify: cib_apply_diff of <diff >
FAILED: Application of an update diff failed, requesting a full refresh
>From the other nodes, the joining nodes is "standby" and logs growing showing
>the messages above. The new node sees the cluster offline and still try to
>update the cib. The major problems seems to be:
<!-- /* Font Definitions */ @font-face {font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4; mso-font-alt:"Calisto MT"; mso-font-charset:0;
mso-generic-font-family:roman; mso-font-pitch:variable;
mso-font-signature:-1610611985 1107304683 0 0 159 0;} @font-face
{font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4; mso-font-alt:"Times New
Roman"; mso-font-charset:0; mso-generic-font-family:swiss;
mso-font-pitch:variable; mso-font-signature:-1610611985 1073750139 0 0 159 0;}
/* Style Definitions */ p.MsoNormal, li.MsoNormal, div.MsoNormal
{mso-style-unhide:no; mso-style-qformat:yes; mso-style-parent:""; margin:0cm;
margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:11.0pt;
font-family:"Calibri","sans-serif"; mso-fareast-font-family:Calibri;
mso-fareast-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman";}
.MsoChpDefault {mso-style-type:export-only; mso-default-props:yes;
font-size:10.0pt; mso-ansi-font-size:10.0pt; mso-bidi-font-size:10.0pt;} @page
Section1 {size:612.0pt 792.0pt; margin:70.85pt 70.85pt 70.85pt 70.85pt;
mso-header-margin:36.0pt; mso-footer-margin:36.0pt; mso-paper-source:0;}
div.Section1 {page:Section1;} -->
heartbeat[6192]: 2009/03/19_01:43:08 WARN: Gmain_timeout_dispatch: Dispatch
function for retransmit request took too long to execute: 20 ms (> 10 ms)
(GSource: 0x8941458)
"Manual" updating through cibadmin did not give any result.
The heartbeat process seems to loop on this problem and the cluster becomes
unstable so that we have to restart the whole cluster. When restarting all the
servers with same configuration at the same time, everything works fine.
The servers are not loaded, network traffic is low, I can't explain the problem
and I can reproduce it on other servers with similar configuration.
When trying to increase the hardcoded dispatch delay, here the new message:
ERROR: Irretrievably lost packet: node dsfrg0111 seq XXXX
Anyone has a similar cluster configuration with many resources that works ?
How to resolve this problem ?
Créez votre adresse électronique [email protected]
1 Go d'espace de stockage, anti-spam et anti-virus intégrés.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems