At the risk of hijacking this thread, like I said I've ran into this problem again, and have captured a log with debug_osd=20, viewable at https://www.dropbox.com/s/8zoos5hhvakcpc4/ceph-osd.3.log?dl=0 - any pointers?
On Tue, Jan 8, 2019 at 11:31 AM Peter Woodman <[email protected]> wrote: > > For the record, in the linked issue, it was thought that this might be > due to write caching. This seems not to be the case, as it happened > again to me with write caching disabled. > > On Tue, Jan 8, 2019 at 11:15 AM Sage Weil <[email protected]> wrote: > > > > I've seen this on luminous, but not on mimic. Can you generate a log with > > debug osd = 20 leading up to the crash? > > > > Thanks! > > sage > > > > > > On Tue, 8 Jan 2019, Paul Emmerich wrote: > > > > > I've seen this before a few times but unfortunately there doesn't seem > > > to be a good solution at the moment :( > > > > > > See also: http://tracker.ceph.com/issues/23145 > > > > > > Paul > > > > > > -- > > > Paul Emmerich > > > > > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > > > > > croit GmbH > > > Freseniusstr. 31h > > > 81247 München > > > www.croit.io > > > Tel: +49 89 1896585 90 > > > > > > On Tue, Jan 8, 2019 at 9:37 AM David Young <[email protected]> > > > wrote: > > > > > > > > Hi all, > > > > > > > > One of my OSD hosts recently ran into RAM contention (was swapping > > > > heavily), and after rebooting, I'm seeing this error on random OSDs in > > > > the cluster: > > > > > > > > --- > > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]: ceph version 13.2.4 > > > > (b10be4d44915a4d78a8e06aa31919e74927b142e) mimic (stable) > > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]: 1: /usr/bin/ceph-osd() > > > > [0xcac700] > > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]: 2: (()+0x11390) > > > > [0x7f8fa5d0e390] > > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]: 3: (gsignal()+0x38) > > > > [0x7f8fa5241428] > > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]: 4: (abort()+0x16a) > > > > [0x7f8fa524302a] > > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]: 5: > > > > (ceph::__ceph_assert_fail(char const*, char const*, int, char > > > > const*)+0x250) [0x7f8fa767c510] > > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]: 6: (()+0x2e5587) > > > > [0x7f8fa767c587] > > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]: 7: > > > > (BlueStore::_txc_add_transaction(BlueStore::TransContext*, > > > > ObjectStore::Transaction*)+0x923) [0xbab5e3] > > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]: 8: > > > > (BlueStore::queue_transactions(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, > > > > std::vector<ObjectStore::Transaction, > > > > std::allocator<ObjectStore::Transaction> >&, > > > > boost::intrusive_ptr<TrackedOp>, ThreadPool::TPHandle*)+0x5c3) > > > > [0xbade03] > > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]: 9: > > > > (ObjectStore::queue_transaction(boost::intrusive_ptr<ObjectStore::CollectionImpl>&, > > > > ObjectStore::Transaction&&, boost::intrusive_ptr<TrackedOp>, > > > > ThreadPool::TPHandle*)+0x82) [0x79c812] > > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]: 10: > > > > (OSD::dispatch_context_transaction(PG::RecoveryCtx&, PG*, > > > > ThreadPool::TPHandle*)+0x58) [0x730ff8] > > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]: 11: > > > > (OSD::dequeue_peering_evt(OSDShard*, PG*, > > > > std::shared_ptr<PGPeeringEvent>, ThreadPool::TPHandle&)+0xfe) [0x759aae] > > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]: 12: (PGPeeringItem::run(OSD*, > > > > OSDShard*, boost::intrusive_ptr<PG>&, ThreadPool::TPHandle&)+0x50) > > > > [0x9c5720] > > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]: 13: > > > > (OSD::ShardedOpWQ::_process(unsigned int, > > > > ceph::heartbeat_handle_d*)+0x590) [0x769760] > > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]: 14: > > > > (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x476) > > > > [0x7f8fa76824f6] > > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]: 15: > > > > (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x7f8fa76836b0] > > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]: 16: (()+0x76ba) > > > > [0x7f8fa5d046ba] > > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]: 17: (clone()+0x6d) > > > > [0x7f8fa531341d] > > > > Jan 08 03:34:36 prod1 ceph-osd[3357939]: NOTE: a copy of the > > > > executable, or `objdump -rdS <executable>` is needed to interpret this. > > > > Jan 08 03:34:36 prod1 systemd[1]: [email protected]: Main process > > > > exited, code=killed, status=6/ABRT > > > > --- > > > > > > > > I've restarted all the OSDs and the mons, but still encountering the > > > > above. > > > > > > > > Any ideas / suggestions? > > > > > > > > Thanks! > > > > D > > > > _______________________________________________ > > > > ceph-users mailing list > > > > [email protected] > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > _______________________________________________ > > > ceph-users mailing list > > > [email protected] > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > _______________________________________________ > > ceph-users mailing list > > [email protected] > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com _______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
