i started an upgrade process to go from 0.94.7 to 10.2.5 on a production
cluster that is using cache tiering. this cluster has 3 monitors, 28
storage nodes, around 370 osds. the upgrade of the monitors completed
without issue. i then upgraded 2 of the storage nodes, and after the
restarts, the osds started crashing during hit_set_trim. here is some of
the output from the log.

2017-03-02 22:41:32.338290 7f8bfd6d7700 -1 osd/ReplicatedPG.cc: In function
'void ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned int)'
thread 7f8bfd6d7700 time 2017-03-02 22:41:32.335020
osd/ReplicatedPG.cc: 10514: FAILED assert(obc)

 ceph version 0.94.7 (d56bdf93ced6b80b07397d57e3fa68fe68304432)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0xbddac5]
 2: (ReplicatedPG::hit_set_trim(ReplicatedPG::RepGather*, unsigned
int)+0x75f) [0x87e48f]
 3: (ReplicatedPG::hit_set_persist()+0xedb) [0x87f4ab]
 4: (ReplicatedPG::do_op(std::tr1::shared_ptr<OpRequest>&)+0xe3a) [0x8a0d1a]
 5: (ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x68a) [0x83be4a]
 6: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
std::tr1::shared_ptr<OpRequest>, ThreadPool::TPHandle&)+0x405) [0x69a5c5]
 7: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x333) [0x69ab33]
 8: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x86f)
[0xbcd1cf]
 9: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0xbcf300]
 10: (()+0x7dc5) [0x7f8c1c209dc5]
 11: (clone()+0x6d) [0x7f8c1aceaced]

it started on just one osd and then spread to others until most of the osds
that are part of the cache tier were crashing. that was happening on both
the osds that were running jewel and on the ones running hammer. in the
process of trying to sort this out, the use_gmt_hitset option was set to
true and all of the osds were upgraded to hammer. we still have not been
able to determine a cause or a fix.

it looks like when hit_set_trim and hit_set_remove_all are being called,
they are calling hit_set_archive_object() to generate a name based on a
timestamp and then calling get_object_context() which then returns nothing
and triggers an assert.

i raised the debug_osd to 10/10 and then analyzed the logs after the crash.
i found the following in the ceph osd log afterwards.

2017-03-03 03:10:31.918470 7f218c842700 10 osd.146 pg_epoch: 266043
pg[19.5d4( v 264786'61233923 (262173'61230715,264786'61233923]
local-les=266043 n=393 ec=83762 les/c/f 266043/264767/0
266042/266042/266042) [146,116,179] r=0 lpr=266042
 pi=264766-266041/431 crt=262323'61233250 lcod 0'0 mlcod 0'0
active+degraded NIBBLEWISE] get_object_context: no obc for soid
19:2ba00000:.ceph-internal::hit_set_19.5d4_archive_2017-03-03
05%3a55%3a58.459084Z_2017-03-03 05%3a56%3a58.98101
6Z:head and !can_create
2017-03-03 03:10:31.921064 7f2194051700 10 osd.146 266043 do_waiters --
start
2017-03-03 03:10:31.921072 7f2194051700 10 osd.146 266043 do_waiters --
finish
2017-03-03 03:10:31.921076 7f2194051700  7 osd.146 266043 handle_pg_notify
from osd.255
2017-03-03 03:10:31.921096 7f2194051700 10 osd.146 266043 do_waiters --
start
2017-03-03 03:10:31.921099 7f2194051700 10 osd.146 266043 do_waiters --
finish
2017-03-03 03:10:31.925858 7f218c041700 -1 osd/ReplicatedPG.cc: In function
'void ReplicatedPG::hit_set_remove_all()' thread 7f218c041700 time
2017-03-03 03:10:31.918201
osd/ReplicatedPG.cc: 11494: FAILED assert(obc)

 ceph version 10.2.5 (c461ee19ecbc0c5c330aca20f7392c9a00730367)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x85) [0x7f21acee9425]
 2: (ReplicatedPG::hit_set_remove_all()+0x412) [0x7f21ac9cba92]
 3: (ReplicatedPG::on_activate()+0x6dd) [0x7f21ac9f73fd]
 4: (PG::RecoveryState::Active::react(PG::AllReplicasActivated
const&)+0xac) [0x7f21ac916adc]
 5: (boost::statechart::simple_state<PG::RecoveryState::Active,
PG::RecoveryState::Primary, PG::RecoveryState::Activating,
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
const&, void const*)+0x179) [0x7f21a
c974909]
 6: (boost::statechart::simple_state<PG::RecoveryState::Activating,
PG::RecoveryState::Active, boost::mpl::list<mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_:
:na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na>,
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
const&, void const*)+0xcd) [0x7f21ac977ccd]
 7: (boost::statechart::state_machine<PG::RecoveryState::RecoveryMachine,
PG::RecoveryState::Initial, std::allocator<void>,
boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base
const&)+0x6b) [0x7f21ac95
d9cb]
 8: (PG::handle_peering_event(std::shared_ptr<PG::CephPeeringEvt>,
PG::RecoveryCtx*)+0x1f4) [0x7f21ac924d24]
 9: (OSD::process_peering_events(std::list<PG*, std::allocator<PG*> >
const&, ThreadPool::TPHandle&)+0x259) [0x7f21ac87de99]
 10: (OSD::PeeringWQ::_process(std::list<PG*, std::allocator<PG*> > const&,
ThreadPool::TPHandle&)+0x12) [0x7f21ac8c8552]
 11: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa7e) [0x7f21aceda46e]
 12: (ThreadPool::WorkThread::entry()+0x10) [0x7f21acedb350]
 13: (()+0x7dc5) [0x7f21aac04dc5]
 14: (clone()+0x6d) [0x7f21a928f73d]
 NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed
to interpret this.

it looks like it was looking for a hit_set with a timestamp
of 05:55:58.459084Z but the timestamp on the log lines is for
3:10:31.921096. we have tried various options of the hit_set_count and
hit_set_period but the osds are still trying to find objects for timestamps
that don't exist. i've looked through the osd pg directories (i.e.
/var/lib/ceph/osd/ceph-##/current/<pg.id>_head) and found hit_set objects
but with wildly different timestamps.

there are a few bugs on tracker that look like their related but we haven't
been able to find a fix from them yet. http://tracker.ceph.com/issues/13185
is the closest one but it hasn't given us a hint on a fix yet.
http://tracker.ceph.com/issues/14399 and http://tracker.ceph.com/issues/9732
also seem related but also haven't helped. we have verified that all nodes
are using the same timezone and have consistent timestamps. we're currently
looking at options to edit the hit_set_setup() code to skip the
hit_set_remove() call and the failing code. this will give us empty hit
sets and the thought is that we will then flush the cache tier and remove
it then add one back. still not sure if its going to work.

does anyone else have any other thoughts about how to handle this or why
this is happening? or what else we could do to get the osds back online?
this has crashed almost all of the cache tier osds and the cluster is
unusable at the moment.

thanks in advance.
mike
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to