On Tue, Feb 12, 2013 at 06:28:15PM +1100, Chris Dunlop wrote:
> Hi,
> 
> What are likely causes for "slow requests" and "monclient: hunting for new
> mon" messages? E.g.:
> 
> 2013-02-12 16:27:07.318943 7f9c0bc16700  0 monclient: hunting for new mon
> ...
> 2013-02-12 16:27:45.892314 7f9c13c26700  0 log [WRN] : 6 slow requests, 6 
> included below; oldest blocked for > 30.383883 secs
> 2013-02-12 16:27:45.892323 7f9c13c26700  0 log [WRN] : slow request 30.383883 
> seconds old, received at 2013-02-12 16:27:15.508374: 
> osd_op(client.9821.0:122242 rb.0.209f.74b0dc51.000000000120 [write 
> 921600~4096] 2.981cf6bc) v4 currently no flag points reached
> 2013-02-12 16:27:45.892328 7f9c13c26700  0 log [WRN] : slow request 30.383782 
> seconds old, received at 2013-02-12 16:27:15.508475: 
> osd_op(client.9821.0:122243 rb.0.209f.74b0dc51.000000000120 [write 
> 987136~4096] 2.981cf6bc) v4 currently no flag points reached
> 2013-02-12 16:27:45.892334 7f9c13c26700  0 log [WRN] : slow request 30.383720 
> seconds old, received at 2013-02-12 16:27:15.508537: 
> osd_op(client.9821.0:122244 rb.0.209f.74b0dc51.000000000120 [write 
> 1036288~8192] 2.981cf6bc) v4 currently no flag points reached
> 2013-02-12 16:27:45.892338 7f9c13c26700  0 log [WRN] : slow request 30.383684 
> seconds old, received at 2013-02-12 16:27:15.508573: 
> osd_op(client.9821.0:122245 rb.0.209f.74b0dc51.000000000122 [write 
> 1454080~4096] 2.fff29a9a) v4 currently no flag points reached
> 2013-02-12 16:27:45.892341 7f9c13c26700  0 log [WRN] : slow request 30.328986 
> seconds old, received at 2013-02-12 16:27:15.563271: 
> osd_op(client.9821.0:122246 rb.0.209f.74b0dc51.000000000122 [write 
> 1482752~4096] 2.fff29a9a) v4 currently no flag points reached

Sorry, I forgot to add what lead me to look into this in the first place...

The b2 machine is running a number of libvirt / kvm virtuals on rbd via
librados:

    <disk type='network' device='disk'>
      <driver name='qemu' type='raw'/>
      <auth username='xxxxxx'>
        <secret type='ceph' uuid='xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx'/>
      </auth>
      <source protocol='rbd' name='rbd/foo:rbd_cache=1'/>
      <target dev='vdb' bus='virtio'/>
      <alias name='virtio-disk1'/>
      <address type='pci' domain='0x0000' bus='0x00' slot='0x06' 
function='0x0'/>
    </disk>

Around the time the "slow requests" messages pop up, the virtual machines 
experience
a spike in disk latency, e.g. from inside one of the virtuals:

                    load %user %nice  %sys  %iow  %stl %idle  dev rrqm/s wrqm/s 
   r/s    w/s    rkB/s    wkB/s arq-sz aqu-sz  await  rwait wwait %util
2013-02-12-18:55:23  1.7   0.3   0.0   0.9  42.2   0.0  18.5  vdb    0.0   25.8 
   0.0    9.6     0.00   136.53  28.44   1.34 139.56   0.00 139.6  70.7
2013-02-12-18:55:38  1.5   0.9   0.0   0.4  35.8   0.0  42.8  vdb    0.0   26.6 
   0.0   10.8     0.00   143.20  26.52   0.46  41.98   0.00  42.0  40.0
2013-02-12-18:55:53  1.4   0.7   0.0   0.4  38.0   0.0  51.7  vdb    0.0    7.5 
   0.0    6.8     0.00    53.33  15.69   0.46  68.08   0.00  68.1  38.8
2013-02-12-18:56:10  2.2   0.0   0.0   0.1   8.0   0.0  15.8  vdb    0.0    2.1 
   0.0    0.2     0.00     9.18  78.00   0.98 4164.00   0.00 4164.0  70.6
2013-02-12-18:56:32  3.7   0.0   0.0   0.1   0.0   0.0   0.0  vdb    0.0    0.1 
   0.0    0.8     0.00     3.27   8.00  24.12 14519.78   0.00 14519.8 100.0
2013-02-12-18:56:47  5.1   0.5   0.0   0.6   0.0   0.0   0.0  vdb    0.0   11.5 
   0.0    5.4     0.00    65.87  24.40   1.60 3620.15   0.00 3620.1  88.4
2013-02-12-18:57:03  5.2   0.6   0.0   1.1   3.5   0.0   0.0  vdb    0.0   19.6 
   0.0    6.9     0.00   101.25  29.19   1.13 162.41   0.00 162.4  87.4
2013-02-12-18:57:20  5.2   0.6   0.0   0.9   9.7   0.0   0.0  vdb    0.0   44.6 
   0.0    9.7     0.00   214.82  44.27   2.41 248.22   0.00 248.2  85.3
2013-02-12-18:57:36  4.4   0.4   0.0   0.5  28.7   0.0  46.5  vdb    0.0   17.6 
   0.0    5.2     0.00    87.50  33.33   0.56 107.14   0.00 107.1  48.5

...corresponding to this in the b5 / osd.1 log:

2013-02-12 18:52:08.812880 7f9c0bc16700  0 monclient: hunting for new mon
2013-02-12 18:55:18.851791 7f9c0bc16700  0 monclient: hunting for new mon
2013-02-12 18:56:25.414948 7f9c13c26700  0 log [WRN] : 6 slow requests, 6 
included below; oldest blocked for > 30.372124 secs
2013-02-12 18:56:25.414958 7f9c13c26700  0 log [WRN] : slow request 30.372124 
seconds old, received at 2013-02-12 18:55:55.042767: 
osd_op(client.9821.0:144779 rb.0.209f.74b0dc51.000000000023 [write 
1593344~4096] 2.1882ddb7) v4 currently no flag points reached
...
2013-02-12 18:57:13.427008 7f9c13c26700  0 log [WRN] : slow request 40.721769 
seconds old, received at 2013-02-12 18:56:32.705190: 
osd_op(client.9821.0:146756 rb.0.209f.74b0dc51.000000000128 [write 819200~8192] 
2.b4390173) v4 currently commit sent
2013-02-12 18:59:43.886517 7f9c0bc16700  0 monclient: hunting for new mon
2013-02-12 19:02:53.911641 7f9c0bc16700  0 monclient: hunting for new mon


Cheers,

Chris.
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to