Can you check and see if swap is being used on your OSD servers when
this happens, and even better, use something like collectl or another
tool to look for major page faults?
If you see anything like this, you may want to tweak swappiness to be
lower (say 10).
Mark
On 06/12/2014 03:17 PM, Xu (Simon) Chen wrote:
I've done some more tracing. It looks like the high IO wait in VMs are
somewhat correlated when some OSDs have high inflight ops (ceph admin
socket, dump_ops_in_flight).
When in_flight_ops is high, I see something like this in the OSD log:
2014-06-12 19:57:24.572338 7f4db6bdf700 1 heartbeat_map reset_timeout
'OSD::op_tp thread 0x7f4db6bdf700' had timed out after 15
Any ideas why this happens?
Thanks.
-Simon
On Thu, Jun 12, 2014 at 11:14 AM, Mark Nelson <[email protected]
<mailto:[email protected]>> wrote:
On 06/12/2014 08:47 AM, Xu (Simon) Chen wrote:
1) I did check iostat on all OSDs, and iowait seems normal.
2) ceph -w shows no correlation between high io wait and high iops.
Sometimes the reverse is true: when io wait is high (since it's a
cluster wide thing), the overall ceph iops drops too.
Not sure if you are doing it yet, but you may want to look at the
statistics the OSDs can provide via the admin socket, especially
outstanding operations and dump_historic_ops. If you look at these
for all of your OSDs you can start getting a feel for whether any
specific OSDs are slow and if so, what slow ops are hanging up on.
3) We have collectd running in VMs, and that's how we identified the
frequent high io wait. This happens for even lightly used VMs.
Thanks.
-Simon
On Thu, Jun 12, 2014 at 9:26 AM, David <[email protected]
<mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>> wrote:
Hi Simon,
Did you check iostat on the OSDs to check their
utilization? What
does your ceph -w say - pehaps you’re maxing your cluster’s
IOPS?
Also, are you running any monitoring of your VMs iostats? We’ve
often found some culprits overusing IOs..
Kind Regards,
David Majchrzak
12 jun 2014 kl. 15:22 skrev Xu (Simon) Chen
<[email protected] <mailto:[email protected]>
<mailto:[email protected] <mailto:[email protected]>>>:
> Hi folks,
>
> We have two similar ceph deployments, but one of them is
having
trouble: VMs running with ceph-provided block devices are
seeing
frequent high io wait, every a few minutes, usually 15-20%,
but as
high as 60-70%. This is cluster-wide and not correlated
with VM's IO
load. We turned on rbd cache and enabled writeback in qemu,
but the
problem persists. No-deepscrub doesn't help either.
>
> Without providing any one of our probably wrong
theories, any
ideas on how to troubleshoot?
>
> Thanks.
> -Simon
> _________________________________________________
> ceph-users mailing list
> [email protected]
<mailto:[email protected]>
<mailto:[email protected].__com
<mailto:[email protected]>>
>
http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
_________________________________________________
ceph-users mailing list
[email protected] <mailto:[email protected]>
http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
_________________________________________________
ceph-users mailing list
[email protected] <mailto:[email protected]>
http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
<http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com