I've done some more tracing. It looks like the high IO wait in VMs are
somewhat correlated when some OSDs have high inflight ops (ceph admin
socket, dump_ops_in_flight).

When in_flight_ops is high, I see something like this in the OSD log:
2014-06-12 19:57:24.572338 7f4db6bdf700  1 heartbeat_map reset_timeout
'OSD::op_tp thread 0x7f4db6bdf700' had timed out after 15

Any ideas why this happens?

Thanks.
-Simon



On Thu, Jun 12, 2014 at 11:14 AM, Mark Nelson <[email protected]>
wrote:

> On 06/12/2014 08:47 AM, Xu (Simon) Chen wrote:
>
>> 1) I did check iostat on all OSDs, and iowait seems normal.
>> 2) ceph -w shows no correlation between high io wait and high iops.
>> Sometimes the reverse is true: when io wait is high (since it's a
>> cluster wide thing), the overall ceph iops drops too.
>>
>
> Not sure if you are doing it yet, but you may want to look at the
> statistics the OSDs can provide via the admin socket, especially
> outstanding operations and dump_historic_ops.  If you look at these for all
> of your OSDs you can start getting a feel for whether any specific OSDs are
> slow and if so, what slow ops are hanging up on.
>
>  3) We have collectd running in VMs, and that's how we identified the
>> frequent high io wait. This happens for even lightly used VMs.
>>
>> Thanks.
>> -Simon
>>
>>
>> On Thu, Jun 12, 2014 at 9:26 AM, David <[email protected]
>> <mailto:[email protected]>> wrote:
>>
>>     Hi Simon,
>>
>>     Did you check iostat on the OSDs to check their utilization? What
>>     does your ceph -w say - pehaps you’re maxing your cluster’s IOPS?
>>     Also, are you running any monitoring of your VMs iostats? We’ve
>>     often found some culprits overusing IOs..
>>
>>     Kind Regards,
>>     David Majchrzak
>>
>>     12 jun 2014 kl. 15:22 skrev Xu (Simon) Chen <[email protected]
>>     <mailto:[email protected]>>:
>>
>>
>>      > Hi folks,
>>      >
>>      > We have two similar ceph deployments, but one of them is having
>>     trouble: VMs running with ceph-provided block devices are seeing
>>     frequent high io wait, every a few minutes, usually 15-20%, but as
>>     high as 60-70%. This is cluster-wide and not correlated with VM's IO
>>     load. We turned on rbd cache and enabled writeback in qemu, but the
>>     problem persists. No-deepscrub doesn't help either.
>>      >
>>      > Without providing any one of our probably wrong theories, any
>>     ideas on how to troubleshoot?
>>      >
>>      > Thanks.
>>      > -Simon
>>      > _______________________________________________
>>      > ceph-users mailing list
>>      > [email protected] <mailto:[email protected]>
>>      > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>>
>>
>> _______________________________________________
>> ceph-users mailing list
>> [email protected]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to