We actually disabled swap all together on these machines...

On Thu, Jun 12, 2014 at 5:06 PM, Gregory Farnum <[email protected]> wrote:

> To be clear, that's the solution to one of the causes of this issue.
> The log message is very general, and just means that a disk access
> thread has been gone for a long time (15 seconds, in this case)
> without checking in (so usually, it's been inside of a read/write
> syscall for >=15 seconds).
> Other causes include simple overload of the OSDs in question, or a
> broken local filesystem, or...
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
> On Thu, Jun 12, 2014 at 1:59 PM, Mark Nelson <[email protected]>
> wrote:
> > Can you check and see if swap is being used on your OSD servers when this
> > happens, and even better, use something like collectl or another tool to
> > look for major page faults?
> >
> > If you see anything like this, you may want to tweak swappiness to be
> lower
> > (say 10).
> >
> > Mark
> >
> >
> > On 06/12/2014 03:17 PM, Xu (Simon) Chen wrote:
> >>
> >> I've done some more tracing. It looks like the high IO wait in VMs are
> >> somewhat correlated when some OSDs have high inflight ops (ceph admin
> >> socket, dump_ops_in_flight).
> >>
> >> When in_flight_ops is high, I see something like this in the OSD log:
> >> 2014-06-12 19:57:24.572338 7f4db6bdf700  1 heartbeat_map reset_timeout
> >> 'OSD::op_tp thread 0x7f4db6bdf700' had timed out after 15
> >>
> >> Any ideas why this happens?
> >>
> >> Thanks.
> >> -Simon
> >>
> >>
> >>
> >> On Thu, Jun 12, 2014 at 11:14 AM, Mark Nelson <[email protected]
> >> <mailto:[email protected]>> wrote:
> >>
> >>     On 06/12/2014 08:47 AM, Xu (Simon) Chen wrote:
> >>
> >>         1) I did check iostat on all OSDs, and iowait seems normal.
> >>         2) ceph -w shows no correlation between high io wait and high
> >> iops.
> >>         Sometimes the reverse is true: when io wait is high (since it's
> a
> >>         cluster wide thing), the overall ceph iops drops too.
> >>
> >>
> >>     Not sure if you are doing it yet, but you may want to look at the
> >>     statistics the OSDs can provide via the admin socket, especially
> >>     outstanding operations and dump_historic_ops.  If you look at these
> >>     for all of your OSDs you can start getting a feel for whether any
> >>     specific OSDs are slow and if so, what slow ops are hanging up on.
> >>
> >>         3) We have collectd running in VMs, and that's how we identified
> >> the
> >>         frequent high io wait. This happens for even lightly used VMs.
> >>
> >>         Thanks.
> >>         -Simon
> >>
> >>
> >>         On Thu, Jun 12, 2014 at 9:26 AM, David <[email protected]
> >>         <mailto:[email protected]>
> >>         <mailto:[email protected] <mailto:[email protected]>>> wrote:
> >>
> >>              Hi Simon,
> >>
> >>              Did you check iostat on the OSDs to check their
> >>         utilization? What
> >>              does your ceph -w say - pehaps you’re maxing your cluster’s
> >>         IOPS?
> >>              Also, are you running any monitoring of your VMs iostats?
> >> We’ve
> >>              often found some culprits overusing IOs..
> >>
> >>              Kind Regards,
> >>              David Majchrzak
> >>
> >>              12 jun 2014 kl. 15:22 skrev Xu (Simon) Chen
> >>         <[email protected] <mailto:[email protected]>
> >>              <mailto:[email protected] <mailto:[email protected]>>>:
> >>
> >>
> >>
> >>               > Hi folks,
> >>               >
> >>               > We have two similar ceph deployments, but one of them is
> >>         having
> >>              trouble: VMs running with ceph-provided block devices are
> >>         seeing
> >>              frequent high io wait, every a few minutes, usually 15-20%,
> >>         but as
> >>              high as 60-70%. This is cluster-wide and not correlated
> >>         with VM's IO
> >>              load. We turned on rbd cache and enabled writeback in qemu,
> >>         but the
> >>              problem persists. No-deepscrub doesn't help either.
> >>               >
> >>               > Without providing any one of our probably wrong
> >>         theories, any
> >>              ideas on how to troubleshoot?
> >>               >
> >>               > Thanks.
> >>               > -Simon
> >>               > _________________________________________________
> >>
> >>               > ceph-users mailing list
> >>               > [email protected]
> >>         <mailto:[email protected]>
> >>         <mailto:[email protected].__com
> >>         <mailto:[email protected]>>
> >>               >
> >>         http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
> >>         <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> >>
> >>
> >>
> >>
> >>
> >>         _________________________________________________
> >>
> >>         ceph-users mailing list
> >>         [email protected] <mailto:[email protected]>
> >>         http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
> >>         <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> >>
> >>
> >>     _________________________________________________
> >>
> >>     ceph-users mailing list
> >>     [email protected] <mailto:[email protected]>
> >>     http://lists.ceph.com/__listinfo.cgi/ceph-users-ceph.__com
> >>     <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
> >>
> >>
> >
> > _______________________________________________
> > ceph-users mailing list
> > [email protected]
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to