I've done some more tracing. It looks like the high IO wait in VMs are somewhat correlated when some OSDs have high inflight ops (ceph admin socket, dump_ops_in_flight).
When in_flight_ops is high, I see something like this in the OSD log: 2014-06-12 19:57:24.572338 7f4db6bdf700 1 heartbeat_map reset_timeout 'OSD::op_tp thread 0x7f4db6bdf700' had timed out after 15 Any ideas why this happens? Thanks. -Simon On Thu, Jun 12, 2014 at 11:14 AM, Mark Nelson <[email protected]> wrote: > On 06/12/2014 08:47 AM, Xu (Simon) Chen wrote: > >> 1) I did check iostat on all OSDs, and iowait seems normal. >> 2) ceph -w shows no correlation between high io wait and high iops. >> Sometimes the reverse is true: when io wait is high (since it's a >> cluster wide thing), the overall ceph iops drops too. >> > > Not sure if you are doing it yet, but you may want to look at the > statistics the OSDs can provide via the admin socket, especially > outstanding operations and dump_historic_ops. If you look at these for all > of your OSDs you can start getting a feel for whether any specific OSDs are > slow and if so, what slow ops are hanging up on. > > 3) We have collectd running in VMs, and that's how we identified the >> frequent high io wait. This happens for even lightly used VMs. >> >> Thanks. >> -Simon >> >> >> On Thu, Jun 12, 2014 at 9:26 AM, David <[email protected] >> <mailto:[email protected]>> wrote: >> >> Hi Simon, >> >> Did you check iostat on the OSDs to check their utilization? What >> does your ceph -w say - pehaps you’re maxing your cluster’s IOPS? >> Also, are you running any monitoring of your VMs iostats? We’ve >> often found some culprits overusing IOs.. >> >> Kind Regards, >> David Majchrzak >> >> 12 jun 2014 kl. 15:22 skrev Xu (Simon) Chen <[email protected] >> <mailto:[email protected]>>: >> >> >> > Hi folks, >> > >> > We have two similar ceph deployments, but one of them is having >> trouble: VMs running with ceph-provided block devices are seeing >> frequent high io wait, every a few minutes, usually 15-20%, but as >> high as 60-70%. This is cluster-wide and not correlated with VM's IO >> load. We turned on rbd cache and enabled writeback in qemu, but the >> problem persists. No-deepscrub doesn't help either. >> > >> > Without providing any one of our probably wrong theories, any >> ideas on how to troubleshoot? >> > >> > Thanks. >> > -Simon >> > _______________________________________________ >> > ceph-users mailing list >> > [email protected] <mailto:[email protected]> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> >> >> >> >> _______________________________________________ >> ceph-users mailing list >> [email protected] >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> > _______________________________________________ > ceph-users mailing list > [email protected] > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
