Hi André,

at the cephalocon 2023 last week in amsterdam there were two presentations by Adam and Mark that might help you.

Joachim

___________________________________
Clyso GmbH - Ceph Foundation Member

Am 21.04.23 um 10:53 schrieb André Gemünd:
Dear Ceph-users,

in the meantime I found this ticket which seems to have the same assertion / 
stacktrace but was solved: https://tracker.ceph.com/issues/44532

Anyone have any ideas how it could still happen in 16.2.7?

Greetings
André


----- Am 17. Apr 2023 um 10:30 schrieb Andre Gemuend 
andre.gemu...@scai.fraunhofer.de:

Dear Ceph-users,

we have trouble with a Ceph cluster after a full shutdown. A couple of OSDs
don't start anymore, exiting with SIGABRT very quickly. With debug logs and
lots of work (I find cephadm clusters hard to debug btw) we received the
following stack trace:

debug    -16> 2023-04-14T11:52:17.617+0000 7f10ab4d2700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.7/rpm/el8/BUILD/ceph-16.2.7/src/osd/PGLog.h:
In function 'void PGLog::IndexedLog::add(const pg_log_entry_t&, bool)' thread
7f10ab4d2700 time 2023-04-14T11:52:17.614095+0000

/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.7/rpm/el8/BUILD/ceph-16.2.7/src/osd/PGLog.h:
607: FAILED ceph_assert(head.version == 0 || e.version.version > head.version)


ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific (stable)

1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x158)
[0x55b2dafc7b7e]

2: /usr/bin/ceph-osd(+0x56ad98) [0x55b2dafc7d98]

3: (bool PGLog::append_log_entries_update_missing<pg_missing_set<true>
(hobject_t const&, std::__cxx11::list<pg_log_entry_t,
mempool::pool_allocator<(mempool::pool_index_t)22, pg_log_entry_t> > const&,
bool, PGLog::IndexedLog*, pg_missing_set<true>&, PGLog::LogEntryHandler*,
DoutPrefixProvider const*)+0xc19) [0x55b2db1bb6b9]

4: (PGLog::merge_log(pg_info_t&, pg_log_t&&, pg_shard_t, pg_info_t&,
PGLog::LogEntryHandler*, bool&, bool&)+0xee2) [0x55b2db1adf22]

5: (PeeringState::merge_log(ceph::os::Transaction&, pg_info_t&, pg_log_t&&,
pg_shard_t)+0x75) [0x55b2db33c165]

6: (PeeringState::Stray::react(MLogRec const&)+0xcc) [0x55b2db37adec]

7: (boost::statechart::simple_state<PeeringState::Stray, PeeringState::Started,
boost::mpl::list<mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na,
mpl_::na>,
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base
const&, void const*)+0xd5) [0x55b2db3a6e65]

8: (boost::statechart::state_machine<PeeringState::PeeringMachine,
PeeringState::Initial, std::allocator<boost::statechart::none>,
boost::statechart::null_exception_translator>::process_event(boost::statechart::event_base
const&)+0x5b) [0x55b2db18ef6b]

9: (PG::do_peering_event(std::shared_ptr<PGPeeringEvent>, PeeringCtx&)+0x2d1)
[0x55b2db1839e1]

10: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr<PGPeeringEvent>,
ThreadPool::TPHandle&)+0x29c) [0x55b2db0fde5c]

11: (ceph::osd::scheduler::PGPeeringItem::run(OSD*, OSDShard*,
boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x56) [0x55b2db32d0e6]

12: (OSD::ShardedOpWQ::_process(unsigned int, ceph::heartbeat_handle_d*)+0xc28)
[0x55b2db0efd48]

13: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x5c4)
[0x55b2db7615b4]

14: (ShardedThreadPool::WorkThreadSharded::entry()+0x14) [0x55b2db764254]

15: /lib64/libpthread.so.0(+0x817f) [0x7f10cef1117f]

16: clone()

debug    -15> 2023-04-14T11:52:17.618+0000 7f10b64e8700  3 osd.70 72507
handle_osd_map epochs [72507,72507], i have 72507, src has [68212,72507]

debug    -14> 2023-04-14T11:52:17.619+0000 7f10b64e8700  3 osd.70 72507
handle_osd_map epochs [72507,72507], i have 72507, src has [68212,72507]

debug    -13> 2023-04-14T11:52:17.619+0000 7f10ac4d4700  5 osd.70 pg_epoch:
72507 pg[18.7( v 64162'106 (0'0,64162'106] local-lis/les=72506/72507 n=14
ec=17104/17104 lis/c=72506/72480 les/c/f=72507/72481/0 sis=72506
pruub=9.160680771s) [70,86,41] r=0 lpr=72506 pi=[72480,72506)/1 crt=64162'106
lcod 0'0 mlcod 0'0 active+wait pruub 12.822580338s@ mbc={}] exit
Started/Primary/Active/Activating 0.011269 7 0.000114

# ceph versions
{
    "mon": {
        "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific
        (stable)": 5
    },
    "mgr": {
        "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific
        (stable)": 2
    },
    "osd": {
        "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific
        (stable)": 92
    },
    "mds": {
        "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific
        (stable)": 2
    },
    "rgw": {
        "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific
        (stable)": 2
    },
    "overall": {
        "ceph version 16.2.7 (dd0603118f56ab514f133c8d2e3adfc983942503) pacific
        (stable)": 103
    }
}


Another thing is that things like `ceph -s`, `ceph osd tree`, `rbd ls`,  etc.
work, but `ceph orch ps` (or generally any orch commands) simply hang forever,
seemingly in a futex waiting on a socket to the mons.

If anyone has any ideas how we could get those OSDs back online, I'd be very
grateful for any hints. I'm also on slack.

Greetings
--
André Gemünd, Leiter IT / Head of IT
Fraunhofer-Institute for Algorithms and Scientific Computing
andre.gemu...@scai.fraunhofer.de
Tel: +49 2241 14-4199
/C=DE/O=Fraunhofer/OU=SCAI/OU=People/CN=Andre Gemuend
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to