[Linux-HA] Dispatch function for retransmit request took too long to execute

racciari Thu, 26 Mar 2009 07:09:11 -0700

Hi,

we are using heartbeat in a 6+1 cluster on Linux. It works fine until we 
updated our configuration from 60 resources (8 by servers + 12 on backup) to 
132 resources (20 by servers + 12 on backup). The new cib.xml is near 2000 
lines.
Updating the cib using cibadmin to inject new resources is ok, resources starts 
correctly.


We encountered problems when restarting a node of the cluster asynchronously 
(ie: not a the same time than the others): the node joining the cluster is 
trying to update its cib but it always failed with this messages:
 <!--  /* Font Definitions */  @font-face {font-family:"Cambria Math"; 
panose-1:2 4 5 3 5 4 6 3 2 4; mso-font-alt:"Calisto MT"; mso-font-charset:0; 
mso-generic-font-family:roman; mso-font-pitch:variable; 
mso-font-signature:-1610611985 1107304683 0 0 159 0;} @font-face 
{font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4; mso-font-alt:"Times New 
Roman"; mso-font-charset:0; mso-generic-font-family:swiss; 
mso-font-pitch:variable; mso-font-signature:-1610611985 1073750139 0 0 159 0;}  
/* Style Definitions */  p.MsoNormal, li.MsoNormal, div.MsoNormal 
{mso-style-unhide:no; mso-style-qformat:yes; mso-style-parent:""; margin:0cm; 
margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:11.0pt; 
font-family:"Calibri","sans-serif"; mso-fareast-font-family:Calibri; 
mso-fareast-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman";} 
.MsoChpDefault {mso-style-type:export-only; mso-default-props:yes; 
font-size:10.0pt; mso-ansi-font-size:10.0pt; mso-bidi-font-size:10.0pt;} @page 
Section1 {size:612.0pt 792.0pt; margin:70.85pt 70.85pt 70.85pt 70.85pt; 
mso-header-margin:36.0pt; mso-footer-margin:36.0pt; mso-paper-source:0;} 
div.Section1 {page:Section1;} -->   

cib[6201]: 2009/03/19_01:34:16 info: cib_process_diff: Diff 0.112.13 -> 
0.112.14 not applied to 0.111.1: current "epoch" is less than required
  cib[6201]: 2009/03/19_01:34:16 info: cib_process_diff: Requesting re-sync 
from peer: current "epoch" is less than required
  cib[6201]: 2009/03/19_01:34:16 WARN: do_cib_notify: cib_apply_diff of <diff > 
FAILED: Application of an update diff failed, requesting a full refresh
    cib[6201]: 2009/03/19_01:34:16 WARN: cib_process_diff: Not applying diff 
0.112.14 -> 0.112.15 (sync in progress)
  cib[6201]: 2009/03/19_01:34:16 WARN: do_cib_notify: cib_apply_diff of <diff > 
FAILED: Application of an update diff failed, requesting a full refresh


>From the other nodes, the joining nodes is "standby" and logs growing showing 
>the messages above. The new node sees the cluster offline and still try to 
>update the cib. The major problems seems to be:
 <!--  /* Font Definitions */  @font-face {font-family:"Cambria Math"; 
panose-1:2 4 5 3 5 4 6 3 2 4; mso-font-alt:"Calisto MT"; mso-font-charset:0; 
mso-generic-font-family:roman; mso-font-pitch:variable; 
mso-font-signature:-1610611985 1107304683 0 0 159 0;} @font-face 
{font-family:Calibri; panose-1:2 15 5 2 2 2 4 3 2 4; mso-font-alt:"Times New 
Roman"; mso-font-charset:0; mso-generic-font-family:swiss; 
mso-font-pitch:variable; mso-font-signature:-1610611985 1073750139 0 0 159 0;}  
/* Style Definitions */  p.MsoNormal, li.MsoNormal, div.MsoNormal 
{mso-style-unhide:no; mso-style-qformat:yes; mso-style-parent:""; margin:0cm; 
margin-bottom:.0001pt; mso-pagination:widow-orphan; font-size:11.0pt; 
font-family:"Calibri","sans-serif"; mso-fareast-font-family:Calibri; 
mso-fareast-theme-font:minor-latin; mso-bidi-font-family:"Times New Roman";} 
.MsoChpDefault {mso-style-type:export-only; mso-default-props:yes; 
font-size:10.0pt; mso-ansi-font-size:10.0pt; mso-bidi-font-size:10.0pt;} @page 
Section1 {size:612.0pt 792.0pt; margin:70.85pt 70.85pt 70.85pt 70.85pt; 
mso-header-margin:36.0pt; mso-footer-margin:36.0pt; mso-paper-source:0;} 
div.Section1 {page:Section1;} -->   


heartbeat[6192]: 2009/03/19_01:43:08 WARN: Gmain_timeout_dispatch: Dispatch 
function for retransmit request took too long to execute: 20 ms (> 10 ms) 
(GSource: 0x8941458)
   
 
"Manual" updating through cibadmin did not give any result.

The heartbeat process seems to loop on this problem and the cluster becomes 
unstable so that we have to restart the whole cluster. When restarting all the 
servers with same configuration at the same time, everything works fine.

The servers are not loaded, network traffic is low, I can't explain the problem 
and I can reproduce it on other servers with similar configuration.




When trying to increase the hardcoded dispatch delay, here the new message:
ERROR: Irretrievably lost packet: node dsfrg0111 seq XXXX
 
Anyone has a similar cluster configuration with many resources that works ?
How to resolve this problem ?

  


 Créez votre adresse électronique [email protected] 
 1 Go d'espace de stockage, anti-spam et anti-virus intégrés.
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

[Linux-HA] Dispatch function for retransmit request took too long to execute

Reply via email to