Sam,

Tried it. Injected with 'ceph tell osd.* injectargs -- --no_osd_recover_clone_overlap', then stopped one OSD for ~1 minute. Upon restart, all my Windows VMs have issues until HEALTH_OK.

The recovery was taking an abnormally long time, so I reverted away from --no_osd_recover_clone_overlap after about 10mins, to get back to HEALTH_OK.

Interestingly, a Raring guest running a different video surveillance package proceeded without any issue whatsoever.

Here is an image of the traffic to some of these Windows guests:

http://www.gammacode.com/upload/rbd-hang-with-clone-overlap.jpg

Ceph is outside of HEALTH_OK between ~12:55 and 13:10. Most of these instances rebooted due to an app error caused by the i/o hang shortly after 13:10.

These Windows instances are booted as COW clones from a Glance image using Cinder. They also have a second RBD volume for bulk storage. I'm using qemu 1.5.2.

Thanks,
Mike


On 8/21/2013 1:12 PM, Samuel Just wrote:
Ah, thanks for the correction.
-Sam

On Wed, Aug 21, 2013 at 9:25 AM, Yann ROBIN <yann.ro...@youscribe.com> wrote:
It's osd recover clone overlap (see http://tracker.ceph.com/issues/5401)

-----Original Message-----
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Samuel Just
Sent: mercredi 21 août 2013 17:33
To: Mike Dawson
Cc: Stefan Priebe - Profihost AG; josh.dur...@inktank.com; 
ceph-devel@vger.kernel.org
Subject: Re: still recovery issues with cuttlefish

Have you tried setting osd_recovery_clone_overlap to false?  That seemed to 
help with Stefan's issue.
-Sam

On Wed, Aug 21, 2013 at 8:28 AM, Mike Dawson <mike.daw...@cloudapt.com> wrote:
Sam/Josh,

We upgraded from 0.61.7 to 0.67.1 during a maintenance window this
morning, hoping it would improve this situation, but there was no appreciable 
change.

One node in our cluster fsck'ed after a reboot and got a bit behind.
Our instances backed by RBD volumes were OK at that point, but once
the node booted fully and the OSDs started, all Windows instances with
rbd volumes experienced very choppy performance and were unable to
ingest video surveillance traffic and commit it to disk. Once the
cluster got back to HEALTH_OK, they resumed normal operation.

I tried for a time with conservative recovery settings (osd max
backfills = 1, osd recovery op priority = 1, and osd recovery max
active = 1). No improvement for the guests. So I went to more
aggressive settings to get things moving faster. That decreased the duration of 
the outage.

During the entire period of recovery/backfill, the network looked
fine...no where close to saturation. iowait on all drives look fine as well.

Any ideas?

Thanks,
Mike Dawson



On 8/14/2013 3:04 AM, Stefan Priebe - Profihost AG wrote:

the same problem still occours. Will need to check when i've time to
gather logs again.

Am 14.08.2013 01:11, schrieb Samuel Just:

I'm not sure, but your logs did show that you had >16 recovery ops
in flight, so it's worth a try.  If it doesn't help, you should
collect the same set of logs I'll look again.  Also, there are a few
other patches between 61.7 and current cuttlefish which may help.
-Sam

On Tue, Aug 13, 2013 at 2:03 PM, Stefan Priebe - Profihost AG
<s.pri...@profihost.ag> wrote:


Am 13.08.2013 um 22:43 schrieb Samuel Just <sam.j...@inktank.com>:

I just backported a couple of patches from next to fix a bug where
we weren't respecting the osd_recovery_max_active config in some
cases (1ea6b56170fc9e223e7c30635db02fa2ad8f4b4e).  You can either
try the current cuttlefish branch or wait for a 61.8 release.


Thanks! Are you sure that this is the issue? I don't believe that
but i'll give it a try. I already tested a branch from sage where
he fixed a race regarding max active some weeks ago. So active
recovering was max 1 but the issue didn't went away.

Stefan

-Sam

On Mon, Aug 12, 2013 at 10:34 PM, Samuel Just
<sam.j...@inktank.com>
wrote:

I got swamped today.  I should be able to look tomorrow.  Sorry!
-Sam

On Mon, Aug 12, 2013 at 9:39 PM, Stefan Priebe - Profihost AG
<s.pri...@profihost.ag> wrote:

Did you take a look?

Stefan

Am 11.08.2013 um 05:50 schrieb Samuel Just <sam.j...@inktank.com>:

Great!  I'll take a look on Monday.
-Sam

On Sat, Aug 10, 2013 at 12:08 PM, Stefan Priebe
<s.pri...@profihost.ag> wrote:

Hi Samual,

Am 09.08.2013 23:44, schrieb Samuel Just:

I think Stefan's problem is probably distinct from Mike's.

Stefan: Can you reproduce the problem with

debug osd = 20
debug filestore = 20
debug ms = 1
debug optracker = 20

on a few osds (including the restarted osd), and upload those
osd logs along with the ceph.log from before killing the osd
until after the cluster becomes clean again?



done - you'll find the logs at cephdrop folder:
slow_requests_recovering_cuttlefish

osd.52 was the one recovering

Thanks!

Greets,
Stefan

--
To unsubscribe from this list: send the line "unsubscribe
ceph-devel" in the body of a message to
majord...@vger.kernel.org More majordomo info at
http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel"
in
the body of a message to majord...@vger.kernel.org More majordomo
info at  http://vger.kernel.org/majordomo-info.html

--
To unsubscribe from this list: send the line "unsubscribe
ceph-devel" in the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the 
body of a message to majord...@vger.kernel.org More majordomo info at  
http://vger.kernel.org/majordomo-info.html


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to