Re: still recovery issues with cuttlefish

Samuel Just Wed, 21 Aug 2013 11:10:40 -0700

If the raring guest was fine, I suspect that the issue is not on the OSDs.
-Sam


On Wed, Aug 21, 2013 at 10:55 AM, Mike Dawson <mike.daw...@cloudapt.com> wrote:
> Sam,
>
> Tried it. Injected with 'ceph tell osd.* injectargs --
> --no_osd_recover_clone_overlap', then stopped one OSD for ~1 minute. Upon
> restart, all my Windows VMs have issues until HEALTH_OK.
>
> The recovery was taking an abnormally long time, so I reverted away from
> --no_osd_recover_clone_overlap after about 10mins, to get back to HEALTH_OK.
>
> Interestingly, a Raring guest running a different video surveillance package
> proceeded without any issue whatsoever.
>
> Here is an image of the traffic to some of these Windows guests:
>
> http://www.gammacode.com/upload/rbd-hang-with-clone-overlap.jpg
>
> Ceph is outside of HEALTH_OK between ~12:55 and 13:10. Most of these
> instances rebooted due to an app error caused by the i/o hang shortly after
> 13:10.
>
> These Windows instances are booted as COW clones from a Glance image using
> Cinder. They also have a second RBD volume for bulk storage. I'm using qemu
> 1.5.2.
>
> Thanks,
> Mike
>
>
>
> On 8/21/2013 1:12 PM, Samuel Just wrote:
>>
>> Ah, thanks for the correction.
>> -Sam
>>
>> On Wed, Aug 21, 2013 at 9:25 AM, Yann ROBIN <yann.ro...@youscribe.com>
>> wrote:
>>>
>>> It's osd recover clone overlap (see http://tracker.ceph.com/issues/5401)
>>>
>>> -----Original Message-----
>>> From: ceph-devel-ow...@vger.kernel.org
>>> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Samuel Just
>>> Sent: mercredi 21 août 2013 17:33
>>> To: Mike Dawson
>>> Cc: Stefan Priebe - Profihost AG; josh.dur...@inktank.com;
>>> ceph-devel@vger.kernel.org
>>> Subject: Re: still recovery issues with cuttlefish
>>>
>>> Have you tried setting osd_recovery_clone_overlap to false?  That seemed
>>> to help with Stefan's issue.
>>> -Sam
>>>
>>> On Wed, Aug 21, 2013 at 8:28 AM, Mike Dawson <mike.daw...@cloudapt.com>
>>> wrote:
>>>>
>>>> Sam/Josh,
>>>>
>>>> We upgraded from 0.61.7 to 0.67.1 during a maintenance window this
>>>> morning, hoping it would improve this situation, but there was no
>>>> appreciable change.
>>>>
>>>> One node in our cluster fsck'ed after a reboot and got a bit behind.
>>>> Our instances backed by RBD volumes were OK at that point, but once
>>>> the node booted fully and the OSDs started, all Windows instances with
>>>> rbd volumes experienced very choppy performance and were unable to
>>>> ingest video surveillance traffic and commit it to disk. Once the
>>>> cluster got back to HEALTH_OK, they resumed normal operation.
>>>>
>>>> I tried for a time with conservative recovery settings (osd max
>>>> backfills = 1, osd recovery op priority = 1, and osd recovery max
>>>> active = 1). No improvement for the guests. So I went to more
>>>> aggressive settings to get things moving faster. That decreased the
>>>> duration of the outage.
>>>>
>>>> During the entire period of recovery/backfill, the network looked
>>>> fine...no where close to saturation. iowait on all drives look fine as
>>>> well.
>>>>
>>>> Any ideas?
>>>>
>>>> Thanks,
>>>> Mike Dawson
>>>>
>>>>
>>>>
>>>> On 8/14/2013 3:04 AM, Stefan Priebe - Profihost AG wrote:
>>>>>
>>>>>
>>>>> the same problem still occours. Will need to check when i've time to
>>>>> gather logs again.
>>>>>
>>>>> Am 14.08.2013 01:11, schrieb Samuel Just:
>>>>>>
>>>>>>
>>>>>> I'm not sure, but your logs did show that you had >16 recovery ops
>>>>>> in flight, so it's worth a try.  If it doesn't help, you should
>>>>>> collect the same set of logs I'll look again.  Also, there are a few
>>>>>> other patches between 61.7 and current cuttlefish which may help.
>>>>>> -Sam
>>>>>>
>>>>>> On Tue, Aug 13, 2013 at 2:03 PM, Stefan Priebe - Profihost AG
>>>>>> <s.pri...@profihost.ag> wrote:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Am 13.08.2013 um 22:43 schrieb Samuel Just <sam.j...@inktank.com>:
>>>>>>>
>>>>>>>> I just backported a couple of patches from next to fix a bug where
>>>>>>>> we weren't respecting the osd_recovery_max_active config in some
>>>>>>>> cases (1ea6b56170fc9e223e7c30635db02fa2ad8f4b4e).  You can either
>>>>>>>> try the current cuttlefish branch or wait for a 61.8 release.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks! Are you sure that this is the issue? I don't believe that
>>>>>>> but i'll give it a try. I already tested a branch from sage where
>>>>>>> he fixed a race regarding max active some weeks ago. So active
>>>>>>> recovering was max 1 but the issue didn't went away.
>>>>>>>
>>>>>>> Stefan
>>>>>>>
>>>>>>>> -Sam
>>>>>>>>
>>>>>>>> On Mon, Aug 12, 2013 at 10:34 PM, Samuel Just
>>>>>>>> <sam.j...@inktank.com>
>>>>>>>> wrote:
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I got swamped today.  I should be able to look tomorrow.  Sorry!
>>>>>>>>> -Sam
>>>>>>>>>
>>>>>>>>> On Mon, Aug 12, 2013 at 9:39 PM, Stefan Priebe - Profihost AG
>>>>>>>>> <s.pri...@profihost.ag> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Did you take a look?
>>>>>>>>>>
>>>>>>>>>> Stefan
>>>>>>>>>>
>>>>>>>>>> Am 11.08.2013 um 05:50 schrieb Samuel Just <sam.j...@inktank.com>:
>>>>>>>>>>
>>>>>>>>>>> Great!  I'll take a look on Monday.
>>>>>>>>>>> -Sam
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Aug 10, 2013 at 12:08 PM, Stefan Priebe
>>>>>>>>>>> <s.pri...@profihost.ag> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Hi Samual,
>>>>>>>>>>>>
>>>>>>>>>>>> Am 09.08.2013 23:44, schrieb Samuel Just:
>>>>>>>>>>>>
>>>>>>>>>>>>> I think Stefan's problem is probably distinct from Mike's.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Stefan: Can you reproduce the problem with
>>>>>>>>>>>>>
>>>>>>>>>>>>> debug osd = 20
>>>>>>>>>>>>> debug filestore = 20
>>>>>>>>>>>>> debug ms = 1
>>>>>>>>>>>>> debug optracker = 20
>>>>>>>>>>>>>
>>>>>>>>>>>>> on a few osds (including the restarted osd), and upload those
>>>>>>>>>>>>> osd logs along with the ceph.log from before killing the osd
>>>>>>>>>>>>> until after the cluster becomes clean again?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> done - you'll find the logs at cephdrop folder:
>>>>>>>>>>>> slow_requests_recovering_cuttlefish
>>>>>>>>>>>>
>>>>>>>>>>>> osd.52 was the one recovering
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>
>>>>>>>>>>>> Greets,
>>>>>>>>>>>> Stefan
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>>>>> ceph-devel" in the body of a message to
>>>>>>>>>>> majord...@vger.kernel.org More majordomo info at
>>>>>>>>>>> http://vger.kernel.org/majordomo-info.html
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>>>> ceph-devel"
>>>>>>>> in
>>>>>>>> the body of a message to majord...@vger.kernel.org More majordomo
>>>>>>>> info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>>>
>>>>>> --
>>>>>> To unsubscribe from this list: send the line "unsubscribe
>>>>>> ceph-devel" in the body of a message to majord...@vger.kernel.org
>>>>>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>>
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>>> the body of a message to majord...@vger.kernel.org More majordomo info at
>>> http://vger.kernel.org/majordomo-info.html
>>>
>>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: still recovery issues with cuttlefish

Reply via email to