Re: [ceph-users] OSD full - All RBD Volumes stopped responding

Wido den Hollander Fri, 11 Apr 2014 00:07:24 -0700


> Op 11 april 2014 om 8:50 schreef Josef Johansson <[email protected]>:
> 
> 
> Hi,
> 
> On 11/04/14 07:29, Wido den Hollander wrote:
> >
> >> Op 11 april 2014 om 7:13 schreef Greg Poirier <[email protected]>:
> >>
> >>
> >> One thing to note....
> >> All of our kvm VMs have to be rebooted. This is something I wasn't
> >> expecting.  Tried waiting for them to recover on their own, but that's not
> >> happening. Rebooting them restores service immediately. :/ Not ideal.
> >>
> > A reboot isn't really required though. It could be that the VM itself is in
> > trouble, but from a librados/librbd perspective I/O should simply continue
> > as
> > soon as a osdmap has been received without the "full" flag.
> >
> > It could be that you have to wait some time before the VM continues. This
> > can
> > take up to 15 minutes.
> With other storage solution you would have to change the timeout-value
> for each disk, i.e. changing to 180 secs from 60 secs, for the VMs to
> survive storage problems.
> Does Ceph handle this differently somehow?
>


It's not that RBD does it differently. Librados simply blocks the I/O and thus
dus librbd which then causes Qemu to block.

I've seen VMs survive RBD issues for longer periods then 60 seconds. Gave them
some time and they continued again.

Which exact setting are you talking about? I'm talking about a Qemu/KVM VM
running with a VirtIO drive.

Wido

> Cheers,
> Josef
> > Wido
> >
> >> On Thu, Apr 10, 2014 at 10:12 PM, Greg Poirier
> >> <[email protected]>wrote:
> >>
> >>> Going to try increasing the full ratio. Disk utilization wasn't really
> >>> growing at an unreasonable pace. I'm going to keep an eye on it for the
> >>> next couple of hours and down/out the OSDs if necessary.
> >>>
> >>> We have four more machines that we're in the process of adding (which
> >>> doubles the number of OSDs), but got held up by some networking nonsense.
> >>>
> >>> Thanks for the tips.
> >>>
> >>>
> >>> On Thu, Apr 10, 2014 at 9:51 PM, Sage Weil <[email protected]> wrote:
> >>>
> >>>> On Thu, 10 Apr 2014, Greg Poirier wrote:
> >>>>> Hi,
> >>>>> I have about 200 VMs with a common RBD volume as their root filesystem
> >>>> and a
> >>>>> number of additional filesystems on Ceph.
> >>>>>
> >>>>> All of them have stopped responding. One of the OSDs in my cluster is
> >>>> marked
> >>>>> full. I tried stopping that OSD to force things to rebalance or at
> >>>> least go
> >>>>> to degraded mode, but nothing is responding still.
> >>>>>
> >>>>> I'm not exactly sure what to do or how to investigate. Suggestions?
> >>>> Try marking the osd out or partially out (ceph osd reweight N .9) to move
> >>>> some data off, and/or adjust the full ratio up (ceph pg set_full_ratio
> >>>> .95).  Note that this becomes increasinly dangerous as OSDs get closer to
> >>>> full; add some disks.
> >>>>
> >>>> sage
> >>>
> >>>
> >> _______________________________________________
> >> ceph-users mailing list
> >> [email protected]
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > _______________________________________________
> > ceph-users mailing list
> > [email protected]
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> _______________________________________________
> ceph-users mailing list
> [email protected]
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD full - All RBD Volumes stopped responding

Reply via email to