Re: [ceph-users] Cascading Failure of OSDs

Quentin Hartman Sat, 07 Mar 2015 08:29:51 -0800

Now that I have a better understanding of what's happening, I threw
together a little one-liner to create a report of the errors that the OSDs
are seeing. Lots of missing  / corrupted pg shards:
https://gist.github.com/qhartman/174cc567525060cb462e


I've experimented with exporting / importing the broken pgs with
ceph_objectstore_tool, and while they seem to export correctly, the tool
crashes when trying to import:

root@node12:/var/lib/ceph/osd# ceph_objectstore_tool --op import
--data-path /var/lib/ceph/osd/ceph-19/ --journal-path
/var/lib/ceph/osd/ceph-19/journal --file ~/3.75b.export
Importing pgid 3.75b
Write 2672075b/rbd_data.2bce2ae8944a.0000000000001509/head//3
Write 3473075b/rbd_data.1d6172ae8944a.000000000001636a/head//3
Write f2e4075b/rbd_data.c816f2ae8944a.0000000000000208/head//3
Write f215075b/rbd_data.c4a892ae8944a.0000000000000b6b/head//3
Write c086075b/rbd_data.42a742ae8944a.00000000000002fb/head//3
Write 6f9d075b/rbd_data.1d6172ae8944a.0000000000005ac3/head//3
Write dd9f075b/rbd_data.1d6172ae8944a.000000000001127d/head//3
Write f9f075b/rbd_data.c4a892ae8944a.000000000000f056/head//3
Write 4d71175b/rbd_data.c4a892ae8944a.0000000000009e51/head//3
Write bcc3175b/rbd_data.2bce2ae8944a.000000000000133f/head//3
Write 1356175b/rbd_data.3f862ae8944a.00000000000005d6/head//3
Write d327175b/rbd_data.1d6172ae8944a.000000000001af85/head//3
Write 7388175b/rbd_data.2bce2ae8944a.0000000000001353/head//3
Write 8cda175b/rbd_data.c4a892ae8944a.000000000000b585/head//3
Write 6b3c175b/rbd_data.c4a892ae8944a.0000000000018e91/head//3
Write d37f175b/rbd_data.1d6172ae8944a.0000000000003a90/head//3
Write 4590275b/rbd_data.2bce2ae8944a.0000000000001f67/head//3
Write fe51275b/rbd_data.c4a892ae8944a.000000000000e917/head//3
Write 3402275b/rbd_data.3f5c2ae8944a.0000000000001252/6//3
osd/SnapMapper.cc: In function 'void SnapMapper::add_oid(const hobject_t&,
const std::set<snapid_t>&, MapCacher::Transaction<std::basic_string<char>,
ceph::buffer::list>*)' thread 7fba67ff3900 time 2015-03-07 16:21:57.921820
osd/SnapMapper.cc: 228: FAILED assert(r == -2)
 ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x8b) [0xb94fbb]
 2: (SnapMapper::add_oid(hobject_t const&, std::set<snapid_t,
std::less<snapid_t>, std::allocator<snapid_t> > const&,
MapCacher::Transaction<std::string, ceph::buffer::list>*)+0x63e) [0x7b719e]
 3: (get_attrs(ObjectStore*, coll_t, ghobject_t, ObjectStore::Transaction*,
ceph::buffer::list&, OSDriver&, SnapMapper&)+0x67c) [0x661a1c]
 4: (get_object(ObjectStore*, coll_t, ceph::buffer::list&)+0x3e5) [0x661f85]
 5: (do_import(ObjectStore*, OSDSuperblock&)+0xd61) [0x665be1]
 6: (main()+0x2208) [0x63f178]
 7: (__libc_start_main()+0xf5) [0x7fba627b2ec5]
 8: ceph_objectstore_tool() [0x659577]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
to interpret this.
terminate called after throwing an instance of 'ceph::FailedAssertion'
*** Caught signal (Aborted) **
 in thread 7fba67ff3900
 ceph version 0.87.1 (283c2e7cfa2457799f534744d7d549f83ea1335e)
 1: ceph_objectstore_tool() [0xab1cea]
 2: (()+0x10340) [0x7fba66a95340]
 3: (gsignal()+0x39) [0x7fba627c7cc9]
 4: (abort()+0x148) [0x7fba627cb0d8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fba630d26b5]
 6: (()+0x5e836) [0x7fba630d0836]
 7: (()+0x5e863) [0x7fba630d0863]
 8: (()+0x5eaa2) [0x7fba630d0aa2]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x278) [0xb951a8]
 10: (SnapMapper::add_oid(hobject_t const&, std::set<snapid_t,
std::less<snapid_t>, std::allocator<snapid_t> > const&,
MapCacher::Transaction<std::string, ceph::buffer::list>*)+0x63e) [0x7b719e]
 11: (get_attrs(ObjectStore*, coll_t, ghobject_t,
ObjectStore::Transaction*, ceph::buffer::list&, OSDriver&,
SnapMapper&)+0x67c) [0x661a1c]
 12: (get_object(ObjectStore*, coll_t, ceph::buffer::list&)+0x3e5)
[0x661f85]
 13: (do_import(ObjectStore*, OSDSuperblock&)+0xd61) [0x665be1]
 14: (main()+0x2208) [0x63f178]
 15: (__libc_start_main()+0xf5) [0x7fba627b2ec5]
 16: ceph_objectstore_tool() [0x659577]
Aborted (core dumped)


Which I suppose is expected if it's importing from bad pg data. At this
point I'm really most interested in what I can do to get this cluster
consistent as quickly as possible so I can start coping with the data loss
in the VMs and start restoring from backups where needed. Any guidance in
that direction would be appreciated. Something along the lines of "give up
on that busted pg" is what I'm thinking of, but I haven't noticed anything
that seems to approximate that yet.

Thanks

QH




On Fri, Mar 6, 2015 at 8:47 PM, Quentin Hartman <
qhart...@direwolfdigital.com> wrote:

> Here's more information I have been able to glean:
>
> pg 3.5d3 is stuck inactive for 917.471444, current state incomplete, last
> acting [24]
> pg 3.690 is stuck inactive for 11991.281739, current state incomplete,
> last acting [24]
> pg 4.ca is stuck inactive for 15905.499058, current state incomplete,
> last acting [24]
> pg 3.5d3 is stuck unclean for 917.471550, current state incomplete, last
> acting [24]
> pg 3.690 is stuck unclean for 11991.281843, current state incomplete, last
> acting [24]
> pg 4.ca is stuck unclean for 15905.499162, current state incomplete, last
> acting [24]
> pg 3.19c is incomplete, acting [24] (reducing pool volumes min_size from 2
> may help; search ceph.com/docs for 'incomplete')
> pg 4.ca is incomplete, acting [24] (reducing pool images min_size from 2
> may help; search ceph.com/docs for 'incomplete')
> pg 5.7a is incomplete, acting [24] (reducing pool backups min_size from 2
> may help; search ceph.com/docs for 'incomplete')
> pg 5.6b is incomplete, acting [24] (reducing pool backups min_size from 2
> may help; search ceph.com/docs for 'incomplete')
> pg 3.6bf is incomplete, acting [24] (reducing pool volumes min_size from 2
> may help; search ceph.com/docs for 'incomplete')
> pg 3.690 is incomplete, acting [24] (reducing pool volumes min_size from 2
> may help; search ceph.com/docs for 'incomplete')
> pg 3.5d3 is incomplete, acting [24] (reducing pool volumes min_size from 2
> may help; search ceph.com/docs for 'incomplete')
>
>
> However, that list of incomplete pgs keeps changing each time I run "ceph
> health detail | grep incomplete". For example, here is the output
> regenerated moments after I created the above:
>
> HEALTH_ERR 34 pgs incomplete; 2 pgs inconsistent; 37 pgs peering; 470 pgs
> stale; 13 pgs stuck inactive; 13 pgs stuck unclean; 4 scrub errors; 1/24 in
> osds are down; noout,nodeep-scrub flag(s) set
> pg 3.da is stuck inactive for 7977.699449, current state incomplete, last
> acting [19]
> pg 3.1a4 is stuck inactive for 6364.787502, current state incomplete, last
> acting [14]
> pg 4.c4 is stuck inactive for 8759.642771, current state incomplete, last
> acting [14]
> pg 3.4fa is stuck inactive for 8173.078486, current state incomplete, last
> acting [14]
> pg 3.372 is stuck inactive for 6706.018758, current state incomplete, last
> acting [14]
> pg 3.4ca is stuck inactive for 7121.446109, current state incomplete, last
> acting [14]
> pg 0.6 is stuck inactive for 8759.591368, current state incomplete, last
> acting [14]
> pg 3.343 is stuck inactive for 7996.560271, current state incomplete, last
> acting [14]
> pg 3.453 is stuck inactive for 6420.686656, current state incomplete, last
> acting [14]
> pg 3.4c1 is stuck inactive for 7049.443221, current state incomplete, last
> acting [14]
> pg 3.80 is stuck inactive for 7587.105164, current state incomplete, last
> acting [14]
> pg 3.4a7 is stuck inactive for 5506.691333, current state incomplete, last
> acting [14]
> pg 3.5ce is stuck inactive for 7153.943506, current state incomplete, last
> acting [14]
> pg 3.da is stuck unclean for 11816.026865, current state incomplete, last
> acting [19]
> pg 3.1a4 is stuck unclean for 8759.633093, current state incomplete, last
> acting [14]
> pg 3.4fa is stuck unclean for 8759.658848, current state incomplete, last
> acting [14]
> pg 4.c4 is stuck unclean for 8759.642866, current state incomplete, last
> acting [14]
> pg 3.372 is stuck unclean for 8759.662338, current state incomplete, last
> acting [14]
> pg 3.4ca is stuck unclean for 8759.603350, current state incomplete, last
> acting [14]
> pg 0.6 is stuck unclean for 8759.591459, current state incomplete, last
> acting [14]
> pg 3.343 is stuck unclean for 8759.645236, current state incomplete, last
> acting [14]
> pg 3.453 is stuck unclean for 8759.643875, current state incomplete, last
> acting [14]
> pg 3.4c1 is stuck unclean for 8759.606092, current state incomplete, last
> acting [14]
> pg 3.80 is stuck unclean for 8759.644522, current state incomplete, last
> acting [14]
> pg 3.4a7 is stuck unclean for 12723.462164, current state incomplete, last
> acting [14]
> pg 3.5ce is stuck unclean for 10024.882545, current state incomplete, last
> acting [14]
> pg 3.1a4 is incomplete, acting [14] (reducing pool volumes min_size from 2
> may help; search ceph.com/docs for 'incomplete')
> pg 4.1a1 is incomplete, acting [14] (reducing pool images min_size from 2
> may help; search ceph.com/docs for 'incomplete')
> pg 4.138 is incomplete, acting [14] (reducing pool images min_size from 2
> may help; search ceph.com/docs for 'incomplete')
> pg 3.da is incomplete, acting [19] (reducing pool volumes min_size from 2
> may help; search ceph.com/docs for 'incomplete')
> pg 4.c4 is incomplete, acting [14] (reducing pool images min_size from 2
> may help; search ceph.com/docs for 'incomplete')
> pg 3.80 is incomplete, acting [14] (reducing pool volumes min_size from 2
> may help; search ceph.com/docs for 'incomplete')
> pg 4.70 is incomplete, acting [19] (reducing pool images min_size from 2
> may help; search ceph.com/docs for 'incomplete')
> pg 4.76 is incomplete, acting [19] (reducing pool images min_size from 2
> may help; search ceph.com/docs for 'incomplete')
> pg 4.57 is incomplete, acting [14] (reducing pool images min_size from 2
> may help; search ceph.com/docs for 'incomplete')
> pg 3.4c is incomplete, acting [14] (reducing pool volumes min_size from 2
> may help; search ceph.com/docs for 'incomplete')
> pg 5.18 is incomplete, acting [19] (reducing pool backups min_size from 2
> may help; search ceph.com/docs for 'incomplete')
> pg 4.13 is incomplete, acting [14] (reducing pool images min_size from 2
> may help; search ceph.com/docs for 'incomplete')
> pg 0.6 is incomplete, acting [14] (reducing pool data min_size from 2 may
> help; search ceph.com/docs for 'incomplete')
> pg 3.7dc is incomplete, acting [14] (reducing pool volumes min_size from 2
> may help; search ceph.com/docs for 'incomplete')
> pg 3.6b4 is incomplete, acting [14] (reducing pool volumes min_size from 2
> may help; search ceph.com/docs for 'incomplete')
> pg 3.692 is incomplete, acting [14] (reducing pool volumes min_size from 2
> may help; search ceph.com/docs for 'incomplete')
> pg 3.5fc is incomplete, acting [14] (reducing pool volumes min_size from 2
> may help; search ceph.com/docs for 'incomplete')
> pg 3.5ce is incomplete, acting [14] (reducing pool volumes min_size from 2
> may help; search ceph.com/docs for 'incomplete')
> pg 3.4fa is incomplete, acting [14] (reducing pool volumes min_size from 2
> may help; search ceph.com/docs for 'incomplete')
> pg 3.4ca is incomplete, acting [14] (reducing pool volumes min_size from 2
> may help; search ceph.com/docs for 'incomplete')
> pg 3.4c1 is incomplete, acting [14] (reducing pool volumes min_size from 2
> may help; search ceph.com/docs for 'incomplete')
> pg 3.4a7 is incomplete, acting [14] (reducing pool volumes min_size from 2
> may help; search ceph.com/docs for 'incomplete')
> pg 3.460 is incomplete, acting [19] (reducing pool volumes min_size from 2
> may help; search ceph.com/docs for 'incomplete')
> pg 3.453 is incomplete, acting [14] (reducing pool volumes min_size from 2
> may help; search ceph.com/docs for 'incomplete')
> pg 3.394 is incomplete, acting [14] (reducing pool volumes min_size from 2
> may help; search ceph.com/docs for 'incomplete')
> pg 3.372 is incomplete, acting [14] (reducing pool volumes min_size from 2
> may help; search ceph.com/docs for 'incomplete')
> pg 3.343 is incomplete, acting [14] (reducing pool volumes min_size from 2
> may help; search ceph.com/docs for 'incomplete')
> pg 3.337 is incomplete, acting [14] (reducing pool volumes min_size from 2
> may help; search ceph.com/docs for 'incomplete')
> pg 3.321 is incomplete, acting [14] (reducing pool volumes min_size from 2
> may help; search ceph.com/docs for 'incomplete')
> pg 3.2c0 is incomplete, acting [14] (reducing pool volumes min_size from 2
> may help; search ceph.com/docs for 'incomplete')
> pg 3.27c is incomplete, acting [19] (reducing pool volumes min_size from 2
> may help; search ceph.com/docs for 'incomplete')
> pg 3.27e is incomplete, acting [14] (reducing pool volumes min_size from 2
> may help; search ceph.com/docs for 'incomplete')
> pg 3.244 is incomplete, acting [14] (reducing pool volumes min_size from 2
> may help; search ceph.com/docs for 'incomplete')
> pg 3.207 is incomplete, acting [19] (reducing pool volumes min_size from 2
> may help; search ceph.com/docs for 'incomplete')
>
>
> Why would this keep changing? It seems like it would have to be because of
> the OSDs running through their crash loops, only accurately reporting from
> time to time, making it difficult to get an accurate view of the extent of
> the damage.
>
>
> On Fri, Mar 6, 2015 at 8:30 PM, Quentin Hartman <
> qhart...@direwolfdigital.com> wrote:
>
>> Thanks for the response. Is this the post you are referring to?
>> http://ceph.com/community/incomplete-pgs-oh-my/
>>
>> For what it's worth, this cluster was running happily for the better part
>> of a year until the event from this weekend that I described in my first
>> post, so I doubt it's configuration issue. I suppose it could be some
>> edge-casey thing, that only came up just now, but that seems unlikely. Our
>> usage of this cluster has been much heavier in the past than it has been
>> recently.
>>
>> And yes, I have what looks to be about 8 pg shards on several OSDs that
>> seem to be in this state, but it's hard to say for sure. It seems like each
>> time I look at this more problems are popping up.
>>
>> On Fri, Mar 6, 2015 at 8:19 PM, Gregory Farnum <g...@gregs42.com> wrote:
>>
>>> This might be related to the backtrace assert, but that's the problem
>>> you need to focus on. In particular, both of these errors are caused
>>> by the scrub code, which Sage suggested temporarily disabling — if
>>> you're still getting these messages, you clearly haven't done so
>>> successfully.
>>>
>>> That said, it looks like the problem is that the object and/or object
>>> info specified here are just totally busted. You probably want to
>>> figure out what happened there since these errors are normally a
>>> misconfiguration somewhere (e.g., setting nobarrier on fs mount and
>>> then losing power). I'm not sure if there's a good way to repair the
>>> object, but if you can lose the data I'd grab the ceph-objectstore
>>> tool and remove the object from each OSD holding it that way. (There's
>>> a walkthrough of using it for a similar situation in a recent Ceph
>>> blog post.)
>>>
>>> On Fri, Mar 6, 2015 at 7:14 PM, Quentin Hartman
>>> <qhart...@direwolfdigital.com> wrote:
>>> > Alright, tried a few suggestions for repairing this state, but I don't
>>> seem
>>> > to have any PG replicas that have good copies of the missing / zero
>>> length
>>> > shards. What do I do now? telling the pg's to repair doesn't seem to
>>> help
>>> > anything? I can deal with data loss if I can figure out which images
>>> might
>>> > be damaged, I just need to get the cluster consistent enough that the
>>> things
>>> > which aren't damaged can be usable.
>>> >
>>> > Also, I'm seeing these similar, but not quite identical, error
>>> messages as
>>> > well. I assume they are referring to the same root problem:
>>> >
>>> > -1> 2015-03-07 03:12:49.217295 7fc8ab343700  0 log [ERR] : 3.69d shard
>>> 22:
>>> > soid dd85669d/rbd_data.3f7a2ae8944a.00000000000019a5/7//3 size 0 !=
>>> known
>>> > size 4194304
>>>
>>> Mmm, unfortunately that's a different object than the one referenced
>>> in the earlier crash. Maybe it's repairable, or it might be the same
>>> issue — looks like maybe you've got some widespread data loss.
>>> -Greg
>>>
>>> >
>>> >
>>> >
>>> > On Fri, Mar 6, 2015 at 7:54 PM, Quentin Hartman
>>> > <qhart...@direwolfdigital.com> wrote:
>>> >>
>>> >> Finally found an error that seems to provide some direction:
>>> >>
>>> >> -1> 2015-03-07 02:52:19.378808 7f175b1cf700  0 log [ERR] : scrub 3.18e
>>> >> e08a418e/rbd_data.3f7a2ae8944a.00000000000016c8/7//3 on disk size (0)
>>> does
>>> >> not match object info size (4120576) ajusted for ondisk to (4120576)
>>> >>
>>> >> I'm diving into google now and hoping for something useful. If anyone
>>> has
>>> >> a suggestion, I'm all ears!
>>> >>
>>> >> QH
>>> >>
>>> >> On Fri, Mar 6, 2015 at 6:26 PM, Quentin Hartman
>>> >> <qhart...@direwolfdigital.com> wrote:
>>> >>>
>>> >>> Thanks for the suggestion, but that doesn't seem to have made a
>>> >>> difference.
>>> >>>
>>> >>> I've shut the entire cluster down and brought it back up, and my
>>> config
>>> >>> management system seems to have upgraded ceph to 0.80.8 during the
>>> reboot.
>>> >>> Everything seems to have come back up, but I am still seeing the
>>> crash
>>> >>> loops, so that seems to indicate that this is definitely something
>>> >>> persistent, probably tied to the OSD data, rather than some weird
>>> transient
>>> >>> state.
>>> >>>
>>> >>>
>>> >>> On Fri, Mar 6, 2015 at 5:51 PM, Sage Weil <s...@newdream.net> wrote:
>>> >>>>
>>> >>>> It looks like you may be able to work around the issue for the
>>> moment
>>> >>>> with
>>> >>>>
>>> >>>>  ceph osd set nodeep-scrub
>>> >>>>
>>> >>>> as it looks like it is scrub that is getting stuck?
>>> >>>>
>>> >>>> sage
>>> >>>>
>>> >>>>
>>> >>>> On Fri, 6 Mar 2015, Quentin Hartman wrote:
>>> >>>>
>>> >>>> > Ceph health detail - http://pastebin.com/5URX9SsQpg dump summary
>>> (with
>>> >>>> > active+clean pgs removed) - http://pastebin.com/Y5ATvWDZ
>>> >>>> > an osd crash log (in github gist because it was too big for
>>> pastebin)
>>> >>>> > -
>>> >>>> > https://gist.github.com/qhartman/cb0e290df373d284cfb5
>>> >>>> >
>>> >>>> > And now I've got four OSDs that are looping.....
>>> >>>> >
>>> >>>> > On Fri, Mar 6, 2015 at 5:33 PM, Quentin Hartman
>>> >>>> > <qhart...@direwolfdigital.com> wrote:
>>> >>>> >       So I'm in the middle of trying to triage a problem with my
>>> ceph
>>> >>>> >       cluster running 0.80.5. I have 24 OSDs spread across 8
>>> machines.
>>> >>>> >       The cluster has been running happily for about a year. This
>>> last
>>> >>>> >       weekend, something caused the box running the MDS to sieze
>>> hard,
>>> >>>> >       and when we came in on monday, several OSDs were down or
>>> >>>> >       unresponsive. I brought the MDS and the OSDs back on
>>> online, and
>>> >>>> >       managed to get things running again with minimal data loss.
>>> Had
>>> >>>> >       to mark a few objects as lost, but things were apparently
>>> >>>> >       running fine at the end of the day on Monday.
>>> >>>> > This afternoon, I noticed that one of the OSDs was apparently
>>> stuck in
>>> >>>> > a crash/restart loop, and the cluster was unhappy. Performance
>>> was in
>>> >>>> > the tank and "ceph status" is reporting all manner of problems,
>>> as one
>>> >>>> > would expect if an OSD is misbehaving. I marked the offending OSD
>>> out,
>>> >>>> > and the cluster started rebalancing as expected. However, I
>>> noticed a
>>> >>>> > short while later, another OSD has started into a crash/restart
>>> loop.
>>> >>>> > So, I repeat the process. And it happens again. At this point I
>>> >>>> > notice, that there are actually two at a time which are in this
>>> state.
>>> >>>> >
>>> >>>> > It's as if there's some toxic chunk of data that is getting passed
>>> >>>> > around, and when it lands on an OSD it kills it. Contrary to that,
>>> >>>> > however, I tried just stopping an OSD when it's in a bad state,
>>> and
>>> >>>> > once the cluster starts to try rebalancing with that OSD down and
>>> not
>>> >>>> > previously marked out, another OSD will start crash-looping.
>>> >>>> >
>>> >>>> > I've investigated the disk of the first OSD I found with this
>>> problem,
>>> >>>> > and it has no apparent corruption on the file system.
>>> >>>> >
>>> >>>> > I'll follow up to this shortly with links to pastes of log
>>> snippets.
>>> >>>> > Any input would be appreciated. This is turning into a real
>>> cascade
>>> >>>> > failure, and I haven't any idea how to stop it.
>>> >>>> >
>>> >>>> > QH
>>> >>>> >
>>> >>>> >
>>> >>>> >
>>> >>>> >
>>> >>>
>>> >>>
>>> >>
>>> >
>>> >
>>> > _______________________________________________
>>> > ceph-users mailing list
>>> > ceph-users@lists.ceph.com
>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> >
>>>
>>
>>
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cascading Failure of OSDs

Reply via email to