Hi,

On Thu, Oct 15, 2009 at 07:02:55AM -0700, Robinson, Eric wrote:
>  
> > Backups are known to be quite demanding. I'd suggest 
> > collecting data with sar.
> 
> Done.
> 
> The backup starts at 1:45 AM and finishes at 4:07 AM. 
> 
> I got the warning again 4 times last night between 1:50 AM and 4:00 AM.
> So your theory may be correct. However... 
> 
> The longest delay was at 1:50:29 AM:
> 
> heartbeat[9953]: 2009/10/15_01:50:29 WARN: Gmain_timeout_dispatch:
> Dispatch function for send local status was delayed 10040 ms (> 1010 ms)
> before being called (GSource: 0x9c282c8)
> 
> Here is the sar output for that same time. Note that the load is not
> especially high...
> 
>                   CPU     %user     %nice   %system   %iowait
> %steal     %idle
> 01:50:15 AM     all      0.75      0.00      3.00      2.62      0.00
> 93.62
> 01:50:16 AM     all      0.75      0.00      3.00      2.62      0.00
> 93.62
> 01:50:17 AM     all      0.87      0.00      6.98      5.61      0.00
> 86.53
> 01:50:18 AM     all      0.75      0.00      7.63     15.27      0.00
> 76.35
> 01:50:19 AM     all      1.00      0.00      3.49     11.10      0.00
> 84.41
> 01:50:20 AM     all      0.88      0.00      3.38     10.50      0.00
> 85.25
> 01:50:21 AM     all      0.75      0.00      3.50     10.61      0.00
> 85.14
> 01:50:22 AM     all      1.25      0.00      3.00     10.74      0.00
> 85.02
> 01:50:23 AM     all      1.25      0.00      3.00     11.11      0.00
> 84.64
> 01:50:24 AM     all      1.37      0.00      3.00     10.99      0.00
> 84.64
> 01:50:25 AM     all      1.25      0.00      3.25     10.62      0.00
> 84.88
> 01:50:26 AM     all      1.25      0.00      3.25     12.38      0.00
> 83.12
> 01:50:27 AM     all      1.37      0.00      3.37     12.84      0.00
> 82.42
> 01:50:28 AM     all      1.25      0.00      3.38     10.62      0.00
> 84.75
> 01:50:29 AM     all      1.25      0.00      3.62     12.25      0.00
> 82.88
> 01:50:30 AM     all      1.37      0.00      3.12     15.71      0.00
> 79.80
> 01:50:31 AM     all      1.38      0.00      3.25     12.38      0.00
> 83.00
> 01:50:32 AM     all      1.25      0.00      3.37     10.61      0.00
> 84.77
> 01:50:33 AM     all      1.12      0.00      2.87      7.87      0.00
> 88.14
> 01:50:34 AM     all      0.25      0.00      0.62      0.50      0.00
> 98.62
> 01:50:35 AM     all      0.62      0.00      1.25      2.25      0.00
> 95.88
> 01:50:36 AM     all      0.75      0.00      1.62      1.87      0.00
> 95.76
> 01:50:37 AM     all      0.62      0.00      1.50      1.88      0.00
> 96.00
> 01:50:38 AM     all      0.75      0.00      1.75      2.38      0.00
> 95.12
> 01:50:39 AM     all      0.75      0.00      1.75      2.75      0.00
> 94.76
> 01:50:40 AM     all      0.50      0.00      1.75      5.88      0.00
> 91.86
> 
> The next longest delay was at 2:00:22 AM:
> 
> heartbeat[9953]: 2009/10/15_02:00:22 WARN: Gmain_timeout_dispatch:
> Dispatch function for send local status was delayed 4900 ms (> 1010 ms)
> before being called (GSource: 0x9c282c8)
> 
> During this period, the load was quite high...
> 
>                   CPU     %user     %nice   %system   %iowait
> %steal     %idle
> 02:00:07 AM     all      0.00      0.00      0.00     25.03      0.00
> 74.97
> 02:00:08 AM     all      0.75      0.00      2.37     34.58      0.00
> 62.30
> 02:00:09 AM     all      1.00      0.00      2.37     32.21      0.00
> 64.42
> 02:00:10 AM     all      6.26      0.00      9.26     45.93      0.00
> 38.55
> 02:00:11 AM     all      7.35      0.00     10.59     59.90      0.00
> 22.17
> 02:00:12 AM     all     25.09      0.00     13.86     53.68      0.00
> 7.37
> 02:00:13 AM     all     23.12      0.00     13.75     54.12      0.00
> 9.00
> 02:00:14 AM     all     26.22      0.00     26.84     41.20      0.00
> 5.74
> 02:00:15 AM     all     17.60      0.00     24.84     46.44      0.00
> 11.11
> 02:00:16 AM     all     17.08      0.00     25.19     47.88      0.00
> 9.85
> 02:00:17 AM     all     15.11      0.00     30.46     41.45      0.00
> 12.98
> 02:00:18 AM     all     23.88      0.00     25.75     42.00      0.00
> 8.38
> 02:00:19 AM     all     15.61      0.00     21.72     53.93      0.00
> 8.74
> 02:00:20 AM     all      9.11      0.00     20.35     62.17      0.00
> 8.36
> 02:00:21 AM     all      7.88      0.00     17.65     64.96      0.00
> 9.51
> 02:00:22 AM     all      6.49      0.00      9.24     60.55      0.00
> 23.72
> 02:00:23 AM     all     10.49      0.00      5.99     65.67      0.00
> 17.85
> 02:00:24 AM     all     30.80      0.00     21.45     35.41      0.00
> 12.34
> 02:00:25 AM     all     21.88      0.00     25.50     46.62      0.00
> 6.00
> 02:00:26 AM     all     27.88      0.00     25.50     40.50      0.00
> 6.12
> 02:00:27 AM     all     17.12      0.00     26.62     48.62      0.00
> 7.62
> 02:00:28 AM     all     20.05      0.00     25.78     50.56      0.00
> 3.61
> 02:00:29 AM     all     20.75      0.00     23.12     48.75      0.00
> 7.38
> 02:00:30 AM     all     13.86      0.00     22.10     61.17      0.00
> 2.87
> 02:00:31 AM     all     14.98      0.00     22.10     55.43      0.00
> 7.49
> 02:00:32 AM     all     18.93      0.00     21.67     52.30      0.00
> 7.10
> 02:00:33 AM     all     29.82      0.00     24.31     37.72      0.00
> 8.15
> 02:00:34 AM     all     34.83      0.00     30.34     32.08      0.00
> 2.75
> 02:00:35 AM     all     37.58      0.00     29.96     28.46      0.00
> 4.00
> 02:00:36 AM     all     30.46      0.00     25.47     39.95      0.00
> 4.12
> 02:00:37 AM     all     24.72      0.00     26.97     39.45      0.00
> 8.86
> 02:00:38 AM     all     23.53      0.00     26.41     45.43      0.00
> 4.63
> 02:00:39 AM     all     26.22      0.00     23.72     41.57      0.00
> 8.49
> 
> 
> The next longest delay was at 4:00:23 AM:
> 
> heartbeat[9953]: 2009/10/15_04:00:23 WARN: Gmain_timeout_dispatch:
> Dispatch function for send local status was delayed 3160 ms (> 1010 ms)
> before being called (GSource: 0x9c282c8)
> 
> Unfortunately, my sar script stopped capturing at 3:00, so I don't know
> if the load was also high during this period. What I *do* know is that
> the longest delay warning occurred when the system load was relatively
> low.
> 
> I would appreciate your follow-up thoughts. What would cause heartbeat
> to be delayed in sending the local status when the load was low? 

That's a kernel/performance issue. Perhaps there's an issue with
the storage driver. The fact that the load seems to be low is
probably because the metrics you're looking at are misleading.
For instance, the first delay occurs around the time the backup
client starts and it usually starts by scanning all filesystems
which probably means many small storage requests. The later load
is probably due to big I/O (file transfer) and it seems to bother
the kernel a bit less than the former one.

Thanks,

Dejan

> --
> Eric Robinson
> 
> 
> Disclaimer - October 15, 2009 
> This email and any files transmitted with it are confidential and intended 
> solely for General Linux-HA mailing list. If you are not the named addressee 
> you should not disseminate, distribute, copy or alter this email. Any views 
> or opinions presented in this email are solely those of the author and might 
> not represent those of Physician Select Management and Physician's Managed 
> Care. Warning: Although Physician Select Management and Physician's Managed 
> Care have taken reasonable precautions to ensure no viruses are present in 
> this email, the companies cannot accept responsibility for any loss or damage 
> arising from the use of this email or attachments. 
> This disclaimer was added by Policy Patrol: http://www.policypatrol.com/
> _______________________________________________
> Linux-HA mailing list
> [email protected]
> http://lists.linux-ha.org/mailman/listinfo/linux-ha
> See also: http://linux-ha.org/ReportingProblems
_______________________________________________
Linux-HA mailing list
[email protected]
http://lists.linux-ha.org/mailman/listinfo/linux-ha
See also: http://linux-ha.org/ReportingProblems

Reply via email to