Re: [ceph-users] Cascading Failure of OSDs

Quentin Hartman Sat, 07 Mar 2015 10:36:45 -0800

So I'm not sure what has changed, but in the last 30 minutes the errors
which were all over the place, have finally settled down to this:
http://pastebin.com/VuCKwLDp


The only thing I can think of is that I also net the noscrub flag in
addition to the nodeep-scrub when I first got here, and that finally
"took". Anyway, they've been stable there for some time now, and I've been
able to get a couple VMs to come up and behave reasonably well. At this
point I'm prepared to wipe the entire cluster and start over if I have to
to get it truly consistent again, since my efforts to zap pg 3.75b haven't
borne fruit. However, if anyone has a less nuclear option they'd like to
suggest, I'm all ears.

I've tried to export/re-import the pg and do a force_create. The import
failed, and the force_create just reverted back to being incomplete after
"creating" for a few minutes.

QH

On Sat, Mar 7, 2015 at 9:29 AM, Quentin Hartman <
qhart...@direwolfdigital.com> wrote:

> Now that I have a better understanding of what's happening, I threw
> together a little one-liner to create a report of the errors that the OSDs
> are seeing. Lots of missing  / corrupted pg shards:
> https://gist.github.com/qhartman/174cc567525060cb462e
>
> I've experimented with exporting / importing the broken pgs with
> ceph_objectstore_tool, and while they seem to export correctly, the tool
> crashes when trying to import:
>
> root@node12:/var/lib/ceph/osd# ceph_objectstore_tool --op import
> --data-path /var/lib/ceph/osd/ceph-19/ --journal-path
> /var/lib/ceph/osd/ceph-19/journal --file ~/3.75b.export
> Importing pgid 3.75b
> Write 2672075b/rbd_data.2bce2ae8944a.0000000000001509/head//3
> Write 3473075b/rbd_data.1d6172ae8944a.000000000001636a/head//3
> Write f2e4075b/rbd_data.c816f2ae8944a.0000000000000208/head//3
> Write f215075b/rbd_data.c4a892ae8944a.0000000000000b6b/head//3
> Write c086075b/rbd_data.42a742ae8944a.00000000000002fb/head//3
> Write 6f9d075b/rbd_data.1d6172ae8944a.0000000000005ac3/head//3
> Write dd9f075b/rbd_data.1d6172ae8944a.000000000001127d/head//3
> Write f9f075b/rbd_data.c4a892ae8944a.000000000000f056/head//3
> Write 4d71175b/rbd_data.c4a892ae8944a.0000000000009e51/head//3
> Write bcc3175b/rbd_data.2bce2ae8944a.000000000000133f/head//3
> Write 1356175b/rbd_data.3f862ae8944a.00000000000005d6/head//3
> Write d327175b/rbd_data.1d6172ae8944a.000000000001af85/head//3
> Write 7388175b/rbd_data.2bce2ae8944a.0000000000001353/head//3
> Write 8cda175b/rbd_data.c4a892ae8944a.000000000000b585/head//3
> Write 6b3c175b/rbd_data.c4a892ae8944a.0000000000018e91/head//3
> Write d37f175b/rbd_data.1d6172ae8944a.0000000000003a90/head//3
> Write 4590275b/rbd_data.2bce2ae8944a.0000000000001f67/head//3
> Write fe51275b/rbd_data.c4a892ae8944a.000000000000e917/head//3
> Write 3402275b/rbd_data.3f5c2ae8944a.0000000000001252/6//3
> osd/SnapMapper.cc: In function 'void SnapMapper::add_oid(const hobject_t&,
> const std::set<snapid_t>&, MapCacher::Transaction<std::basic_string<char>,
> ceph::buffer::list>*)' thread 7fba67ff3900 time 2015-03-07 16:21:57.921820
> osd/SnapMapper.cc: 228: FAILED assert(r == -2)
>  ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x8b) [0xb94fbb]
>  2: (SnapMapper::add_oid(hobject_t const&, std::set<snapid_t,
> std::less<snapid_t>, std::allocator<snapid_t> > const&,
> MapCacher::Transaction<std::string, ceph::buffer::list>*)+0x63e) [0x7b719e]
>  3: (get_attrs(ObjectStore*, coll_t, ghobject_t,
> ObjectStore::Transaction*, ceph::buffer::list&, OSDriver&,
> SnapMapper&)+0x67c) [0x661a1c]
>  4: (get_object(ObjectStore*, coll_t, ceph::buffer::list&)+0x3e5)
> [0x661f85]
>  5: (do_import(ObjectStore*, OSDSuperblock&)+0xd61) [0x665be1]
>  6: (main()+0x2208) [0x63f178]
>  7: (__libc_start_main()+0xf5) [0x7fba627b2ec5]
>  8: ceph_objectstore_tool() [0x659577]
>  NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
> to interpret this.
> terminate called after throwing an instance of 'ceph::FailedAssertion'
> *** Caught signal (Aborted) **
>  in thread 7fba67ff3900
>  ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e)
>  1: ceph_objectstore_tool() [0xab1cea]
>  2: (()+0x10340) [0x7fba66a95340]
>  3: (gsignal()+0x39) [0x7fba627c7cc9]
>  4: (abort()+0x148) [0x7fba627cb0d8]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fba630d26b5]
>  6: (()+0x5e836) [0x7fba630d0836]
>  7: (()+0x5e863) [0x7fba630d0863]
>  8: (()+0x5eaa2) [0x7fba630d0aa2]
>  9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x278) [0xb951a8]
>  10: (SnapMapper::add_oid(hobject_t const&, std::set<snapid_t,
> std::less<snapid_t>, std::allocator<snapid_t> > const&,
> MapCacher::Transaction<std::string, ceph::buffer::list>*)+0x63e) [0x7b719e]
>  11: (get_attrs(ObjectStore*, coll_t, ghobject_t,
> ObjectStore::Transaction*, ceph::buffer::list&, OSDriver&,
> SnapMapper&)+0x67c) [0x661a1c]
>  12: (get_object(ObjectStore*, coll_t, ceph::buffer::list&)+0x3e5)
> [0x661f85]
>  13: (do_import(ObjectStore*, OSDSuperblock&)+0xd61) [0x665be1]
>  14: (main()+0x2208) [0x63f178]
>  15: (__libc_start_main()+0xf5) [0x7fba627b2ec5]
>  16: ceph_objectstore_tool() [0x659577]
> Aborted (core dumped)
>
>
> Which I suppose is expected if it's importing from bad pg data. At this
> point I'm really most interested in what I can do to get this cluster
> consistent as quickly as possible so I can start coping with the data loss
> in the VMs and start restoring from backups where needed. Any guidance in
> that direction would be appreciated. Something along the lines of "give up
> on that busted pg" is what I'm thinking of, but I haven't noticed anything
> that seems to approximate that yet.
>
> Thanks
>
> QH
>
>
>
>
> On Fri, Mar 6, 2015 at 8:47 PM, Quentin Hartman <
> qhart...@direwolfdigital.com> wrote:
>
>> Here's more information I have been able to glean:
>>
>> pg 3.5d3 is stuck inactive for 917.471444, current state incomplete, last
>> acting [24]
>> pg 3.690 is stuck inactive for 11991.281739, current state incomplete,
>> last acting [24]
>> pg 4.ca is stuck inactive for 15905.499058, current state incomplete,
>> last acting [24]
>> pg 3.5d3 is stuck unclean for 917.471550, current state incomplete, last
>> acting [24]
>> pg 3.690 is stuck unclean for 11991.281843, current state incomplete,
>> last acting [24]
>> pg 4.ca is stuck unclean for 15905.499162, current state incomplete,
>> last acting [24]
>> pg 3.19c is incomplete, acting [24] (reducing pool volumes min_size from
>> 2 may help; search ceph.com/docs for 'incomplete')
>> pg 4.ca is incomplete, acting [24] (reducing pool images min_size from 2
>> may help; search ceph.com/docs for 'incomplete')
>> pg 5.7a is incomplete, acting [24] (reducing pool backups min_size from 2
>> may help; search ceph.com/docs for 'incomplete')
>> pg 5.6b is incomplete, acting [24] (reducing pool backups min_size from 2
>> may help; search ceph.com/docs for 'incomplete')
>> pg 3.6bf is incomplete, acting [24] (reducing pool volumes min_size from
>> 2 may help; search ceph.com/docs for 'incomplete')
>> pg 3.690 is incomplete, acting [24] (reducing pool volumes min_size from
>> 2 may help; search ceph.com/docs for 'incomplete')
>> pg 3.5d3 is incomplete, acting [24] (reducing pool volumes min_size from
>> 2 may help; search ceph.com/docs for 'incomplete')
>>
>>
>> However, that list of incomplete pgs keeps changing each time I run "ceph
>> health detail | grep incomplete". For example, here is the output
>> regenerated moments after I created the above:
>>
>> HEALTH_ERR 34 pgs incomplete; 2 pgs inconsistent; 37 pgs peering; 470 pgs
>> stale; 13 pgs stuck inactive; 13 pgs stuck unclean; 4 scrub errors; 1/24 in
>> osds are down; noout,nodeep-scrub flag(s) set
>> pg 3.da is stuck inactive for 7977.699449, current state incomplete, last
>> acting [19]
>> pg 3.1a4 is stuck inactive for 6364.787502, current state incomplete,
>> last acting [14]
>> pg 4.c4 is stuck inactive for 8759.642771, current state incomplete, last
>> acting [14]
>> pg 3.4fa is stuck inactive for 8173.078486, current state incomplete,
>> last acting [14]
>> pg 3.372 is stuck inactive for 6706.018758, current state incomplete,
>> last acting [14]
>> pg 3.4ca is stuck inactive for 7121.446109, current state incomplete,
>> last acting [14]
>> pg 0.6 is stuck inactive for 8759.591368, current state incomplete, last
>> acting [14]
>> pg 3.343 is stuck inactive for 7996.560271, current state incomplete,
>> last acting [14]
>> pg 3.453 is stuck inactive for 6420.686656, current state incomplete,
>> last acting [14]
>> pg 3.4c1 is stuck inactive for 7049.443221, current state incomplete,
>> last acting [14]
>> pg 3.80 is stuck inactive for 7587.105164, current state incomplete, last
>> acting [14]
>> pg 3.4a7 is stuck inactive for 5506.691333, current state incomplete,
>> last acting [14]
>> pg 3.5ce is stuck inactive for 7153.943506, current state incomplete,
>> last acting [14]
>> pg 3.da is stuck unclean for 11816.026865, current state incomplete, last
>> acting [19]
>> pg 3.1a4 is stuck unclean for 8759.633093, current state incomplete, last
>> acting [14]
>> pg 3.4fa is stuck unclean for 8759.658848, current state incomplete, last
>> acting [14]
>> pg 4.c4 is stuck unclean for 8759.642866, current state incomplete, last
>> acting [14]
>> pg 3.372 is stuck unclean for 8759.662338, current state incomplete, last
>> acting [14]
>> pg 3.4ca is stuck unclean for 8759.603350, current state incomplete, last
>> acting [14]
>> pg 0.6 is stuck unclean for 8759.591459, current state incomplete, last
>> acting [14]
>> pg 3.343 is stuck unclean for 8759.645236, current state incomplete, last
>> acting [14]
>> pg 3.453 is stuck unclean for 8759.643875, current state incomplete, last
>> acting [14]
>> pg 3.4c1 is stuck unclean for 8759.606092, current state incomplete, last
>> acting [14]
>> pg 3.80 is stuck unclean for 8759.644522, current state incomplete, last
>> acting [14]
>> pg 3.4a7 is stuck unclean for 12723.462164, current state incomplete,
>> last acting [14]
>> pg 3.5ce is stuck unclean for 10024.882545, current state incomplete,
>> last acting [14]
>> pg 3.1a4 is incomplete, acting [14] (reducing pool volumes min_size from
>> 2 may help; search ceph.com/docs for 'incomplete')
>> pg 4.1a1 is incomplete, acting [14] (reducing pool images min_size from 2
>> may help; search ceph.com/docs for 'incomplete')
>> pg 4.138 is incomplete, acting [14] (reducing pool images min_size from 2
>> may help; search ceph.com/docs for 'incomplete')
>> pg 3.da is incomplete, acting [19] (reducing pool volumes min_size from 2
>> may help; search ceph.com/docs for 'incomplete')
>> pg 4.c4 is incomplete, acting [14] (reducing pool images min_size from 2
>> may help; search ceph.com/docs for 'incomplete')
>> pg 3.80 is incomplete, acting [14] (reducing pool volumes min_size from 2
>> may help; search ceph.com/docs for 'incomplete')
>> pg 4.70 is incomplete, acting [19] (reducing pool images min_size from 2
>> may help; search ceph.com/docs for 'incomplete')
>> pg 4.76 is incomplete, acting [19] (reducing pool images min_size from 2
>> may help; search ceph.com/docs for 'incomplete')
>> pg 4.57 is incomplete, acting [14] (reducing pool images min_size from 2
>> may help; search ceph.com/docs for 'incomplete')
>> pg 3.4c is incomplete, acting [14] (reducing pool volumes min_size from 2
>> may help; search ceph.com/docs for 'incomplete')
>> pg 5.18 is incomplete, acting [19] (reducing pool backups min_size from 2
>> may help; search ceph.com/docs for 'incomplete')
>> pg 4.13 is incomplete, acting [14] (reducing pool images min_size from 2
>> may help; search ceph.com/docs for 'incomplete')
>> pg 0.6 is incomplete, acting [14] (reducing pool data min_size from 2 may
>> help; search ceph.com/docs for 'incomplete')
>> pg 3.7dc is incomplete, acting [14] (reducing pool volumes min_size from
>> 2 may help; search ceph.com/docs for 'incomplete')
>> pg 3.6b4 is incomplete, acting [14] (reducing pool volumes min_size from
>> 2 may help; search ceph.com/docs for 'incomplete')
>> pg 3.692 is incomplete, acting [14] (reducing pool volumes min_size from
>> 2 may help; search ceph.com/docs for 'incomplete')
>> pg 3.5fc is incomplete, acting [14] (reducing pool volumes min_size from
>> 2 may help; search ceph.com/docs for 'incomplete')
>> pg 3.5ce is incomplete, acting [14] (reducing pool volumes min_size from
>> 2 may help; search ceph.com/docs for 'incomplete')
>> pg 3.4fa is incomplete, acting [14] (reducing pool volumes min_size from
>> 2 may help; search ceph.com/docs for 'incomplete')
>> pg 3.4ca is incomplete, acting [14] (reducing pool volumes min_size from
>> 2 may help; search ceph.com/docs for 'incomplete')
>> pg 3.4c1 is incomplete, acting [14] (reducing pool volumes min_size from
>> 2 may help; search ceph.com/docs for 'incomplete')
>> pg 3.4a7 is incomplete, acting [14] (reducing pool volumes min_size from
>> 2 may help; search ceph.com/docs for 'incomplete')
>> pg 3.460 is incomplete, acting [19] (reducing pool volumes min_size from
>> 2 may help; search ceph.com/docs for 'incomplete')
>> pg 3.453 is incomplete, acting [14] (reducing pool volumes min_size from
>> 2 may help; search ceph.com/docs for 'incomplete')
>> pg 3.394 is incomplete, acting [14] (reducing pool volumes min_size from
>> 2 may help; search ceph.com/docs for 'incomplete')
>> pg 3.372 is incomplete, acting [14] (reducing pool volumes min_size from
>> 2 may help; search ceph.com/docs for 'incomplete')
>> pg 3.343 is incomplete, acting [14] (reducing pool volumes min_size from
>> 2 may help; search ceph.com/docs for 'incomplete')
>> pg 3.337 is incomplete, acting [14] (reducing pool volumes min_size from
>> 2 may help; search ceph.com/docs for 'incomplete')
>> pg 3.321 is incomplete, acting [14] (reducing pool volumes min_size from
>> 2 may help; search ceph.com/docs for 'incomplete')
>> pg 3.2c0 is incomplete, acting [14] (reducing pool volumes min_size from
>> 2 may help; search ceph.com/docs for 'incomplete')
>> pg 3.27c is incomplete, acting [19] (reducing pool volumes min_size from
>> 2 may help; search ceph.com/docs for 'incomplete')
>> pg 3.27e is incomplete, acting [14] (reducing pool volumes min_size from
>> 2 may help; search ceph.com/docs for 'incomplete')
>> pg 3.244 is incomplete, acting [14] (reducing pool volumes min_size from
>> 2 may help; search ceph.com/docs for 'incomplete')
>> pg 3.207 is incomplete, acting [19] (reducing pool volumes min_size from
>> 2 may help; search ceph.com/docs for 'incomplete')
>>
>>
>> Why would this keep changing? It seems like it would have to be because
>> of the OSDs running through their crash loops, only accurately reporting
>> from time to time, making it difficult to get an accurate view of the
>> extent of the damage.
>>
>>
>> On Fri, Mar 6, 2015 at 8:30 PM, Quentin Hartman <
>> qhart...@direwolfdigital.com> wrote:
>>
>>> Thanks for the response. Is this the post you are referring to?
>>> http://ceph.com/community/incomplete-pgs-oh-my/
>>>
>>> For what it's worth, this cluster was running happily for the better
>>> part of a year until the event from this weekend that I described in my
>>> first post, so I doubt it's configuration issue. I suppose it could be some
>>> edge-casey thing, that only came up just now, but that seems unlikely. Our
>>> usage of this cluster has been much heavier in the past than it has been
>>> recently.
>>>
>>> And yes, I have what looks to be about 8 pg shards on several OSDs that
>>> seem to be in this state, but it's hard to say for sure. It seems like each
>>> time I look at this more problems are popping up.
>>>
>>> On Fri, Mar 6, 2015 at 8:19 PM, Gregory Farnum <g...@gregs42.com> wrote:
>>>
>>>> This might be related to the backtrace assert, but that's the problem
>>>> you need to focus on. In particular, both of these errors are caused
>>>> by the scrub code, which Sage suggested temporarily disabling — if
>>>> you're still getting these messages, you clearly haven't done so
>>>> successfully.
>>>>
>>>> That said, it looks like the problem is that the object and/or object
>>>> info specified here are just totally busted. You probably want to
>>>> figure out what happened there since these errors are normally a
>>>> misconfiguration somewhere (e.g., setting nobarrier on fs mount and
>>>> then losing power). I'm not sure if there's a good way to repair the
>>>> object, but if you can lose the data I'd grab the ceph-objectstore
>>>> tool and remove the object from each OSD holding it that way. (There's
>>>> a walkthrough of using it for a similar situation in a recent Ceph
>>>> blog post.)
>>>>
>>>> On Fri, Mar 6, 2015 at 7:14 PM, Quentin Hartman
>>>> <qhart...@direwolfdigital.com> wrote:
>>>> > Alright, tried a few suggestions for repairing this state, but I
>>>> don't seem
>>>> > to have any PG replicas that have good copies of the missing / zero
>>>> length
>>>> > shards. What do I do now? telling the pg's to repair doesn't seem to
>>>> help
>>>> > anything? I can deal with data loss if I can figure out which images
>>>> might
>>>> > be damaged, I just need to get the cluster consistent enough that the
>>>> things
>>>> > which aren't damaged can be usable.
>>>> >
>>>> > Also, I'm seeing these similar, but not quite identical, error
>>>> messages as
>>>> > well. I assume they are referring to the same root problem:
>>>> >
>>>> > -1> 2015-03-07 03:12:49.217295 7fc8ab343700  0 log [ERR] : 3.69d
>>>> shard 22:
>>>> > soid dd85669d/rbd_data.3f7a2ae8944a.00000000000019a5/7//3 size 0 !=
>>>> known
>>>> > size 4194304
>>>>
>>>> Mmm, unfortunately that's a different object than the one referenced
>>>> in the earlier crash. Maybe it's repairable, or it might be the same
>>>> issue — looks like maybe you've got some widespread data loss.
>>>> -Greg
>>>>
>>>> >
>>>> >
>>>> >
>>>> > On Fri, Mar 6, 2015 at 7:54 PM, Quentin Hartman
>>>> > <qhart...@direwolfdigital.com> wrote:
>>>> >>
>>>> >> Finally found an error that seems to provide some direction:
>>>> >>
>>>> >> -1> 2015-03-07 02:52:19.378808 7f175b1cf700  0 log [ERR] : scrub
>>>> 3.18e
>>>> >> e08a418e/rbd_data.3f7a2ae8944a.00000000000016c8/7//3 on disk size
>>>> (0) does
>>>> >> not match object info size (4120576) ajusted for ondisk to (4120576)
>>>> >>
>>>> >> I'm diving into google now and hoping for something useful. If
>>>> anyone has
>>>> >> a suggestion, I'm all ears!
>>>> >>
>>>> >> QH
>>>> >>
>>>> >> On Fri, Mar 6, 2015 at 6:26 PM, Quentin Hartman
>>>> >> <qhart...@direwolfdigital.com> wrote:
>>>> >>>
>>>> >>> Thanks for the suggestion, but that doesn't seem to have made a
>>>> >>> difference.
>>>> >>>
>>>> >>> I've shut the entire cluster down and brought it back up, and my
>>>> config
>>>> >>> management system seems to have upgraded ceph to 0.80.8 during the
>>>> reboot.
>>>> >>> Everything seems to have come back up, but I am still seeing the
>>>> crash
>>>> >>> loops, so that seems to indicate that this is definitely something
>>>> >>> persistent, probably tied to the OSD data, rather than some weird
>>>> transient
>>>> >>> state.
>>>> >>>
>>>> >>>
>>>> >>> On Fri, Mar 6, 2015 at 5:51 PM, Sage Weil <s...@newdream.net>
>>>> wrote:
>>>> >>>>
>>>> >>>> It looks like you may be able to work around the issue for the
>>>> moment
>>>> >>>> with
>>>> >>>>
>>>> >>>>  ceph osd set nodeep-scrub
>>>> >>>>
>>>> >>>> as it looks like it is scrub that is getting stuck?
>>>> >>>>
>>>> >>>> sage
>>>> >>>>
>>>> >>>>
>>>> >>>> On Fri, 6 Mar 2015, Quentin Hartman wrote:
>>>> >>>>
>>>> >>>> > Ceph health detail - http://pastebin.com/5URX9SsQpg dump
>>>> summary (with
>>>> >>>> > active+clean pgs removed) - http://pastebin.com/Y5ATvWDZ
>>>> >>>> > an osd crash log (in github gist because it was too big for
>>>> pastebin)
>>>> >>>> > -
>>>> >>>> > https://gist.github.com/qhartman/cb0e290df373d284cfb5
>>>> >>>> >
>>>> >>>> > And now I've got four OSDs that are looping.....
>>>> >>>> >
>>>> >>>> > On Fri, Mar 6, 2015 at 5:33 PM, Quentin Hartman
>>>> >>>> > <qhart...@direwolfdigital.com> wrote:
>>>> >>>> >       So I'm in the middle of trying to triage a problem with my
>>>> ceph
>>>> >>>> >       cluster running 0.80.5. I have 24 OSDs spread across 8
>>>> machines.
>>>> >>>> >       The cluster has been running happily for about a year.
>>>> This last
>>>> >>>> >       weekend, something caused the box running the MDS to sieze
>>>> hard,
>>>> >>>> >       and when we came in on monday, several OSDs were down or
>>>> >>>> >       unresponsive. I brought the MDS and the OSDs back on
>>>> online, and
>>>> >>>> >       managed to get things running again with minimal data
>>>> loss. Had
>>>> >>>> >       to mark a few objects as lost, but things were apparently
>>>> >>>> >       running fine at the end of the day on Monday.
>>>> >>>> > This afternoon, I noticed that one of the OSDs was apparently
>>>> stuck in
>>>> >>>> > a crash/restart loop, and the cluster was unhappy. Performance
>>>> was in
>>>> >>>> > the tank and "ceph status" is reporting all manner of problems,
>>>> as one
>>>> >>>> > would expect if an OSD is misbehaving. I marked the offending
>>>> OSD out,
>>>> >>>> > and the cluster started rebalancing as expected. However, I
>>>> noticed a
>>>> >>>> > short while later, another OSD has started into a crash/restart
>>>> loop.
>>>> >>>> > So, I repeat the process. And it happens again. At this point I
>>>> >>>> > notice, that there are actually two at a time which are in this
>>>> state.
>>>> >>>> >
>>>> >>>> > It's as if there's some toxic chunk of data that is getting
>>>> passed
>>>> >>>> > around, and when it lands on an OSD it kills it. Contrary to
>>>> that,
>>>> >>>> > however, I tried just stopping an OSD when it's in a bad state,
>>>> and
>>>> >>>> > once the cluster starts to try rebalancing with that OSD down
>>>> and not
>>>> >>>> > previously marked out, another OSD will start crash-looping.
>>>> >>>> >
>>>> >>>> > I've investigated the disk of the first OSD I found with this
>>>> problem,
>>>> >>>> > and it has no apparent corruption on the file system.
>>>> >>>> >
>>>> >>>> > I'll follow up to this shortly with links to pastes of log
>>>> snippets.
>>>> >>>> > Any input would be appreciated. This is turning into a real
>>>> cascade
>>>> >>>> > failure, and I haven't any idea how to stop it.
>>>> >>>> >
>>>> >>>> > QH
>>>> >>>> >
>>>> >>>> >
>>>> >>>> >
>>>> >>>> >
>>>> >>>
>>>> >>>
>>>> >>
>>>> >
>>>> >
>>>> > _______________________________________________
>>>> > ceph-users mailing list
>>>> > ceph-users@lists.ceph.com
>>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> >
>>>>
>>>
>>>
>>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cascading Failure of OSDs

Reply via email to