Hi All,
Ive got a strange situation hopefully someone can help with.
We have a backfill occuring, that never completes, the destination osd of the
recovery predictably crashes. Outing the destination osd so another osd takes
the backfill causes a different osd in the cluster then to crash, boot, rinse
and repeat.
The logs show :
--- begin dump of recent events ---
-2> 2019-08-02 06:26:16.133337 7ff9fadf6700 5 -- 10.1.100.22:6808/3657777
>> 10.1.100.6:0/3789781062 conn(0x55d272342000 :6808
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=4238 cs=1 l=1). rx
client.352388821 seq 74 0x55d2723ad740 osd_op(client.352388821.0:46698064
0.142e 0.b4eab42e (undecoded) ondisk+write+known_if_redirected e174744) v8
-1> 2019-08-02 06:26:16.133367 7ff9fadf6700 1 -- 10.1.100.22:6808/3657777
<== client.352388821 10.1.100.6:0/3789781062 74 ====
osd_op(client.352388821.0:46698064 0.142e 0.b4eab42e (undecoded)
ondisk+write+known_if_redirected e174744) v8 ==== 248+0+16384 (881189615 0
2173568771) 0x55d2723ad740 con 0x55d272342000
0> 2019-08-02 06:26:16.185021 7ff9df594700 -1 *** Caught signal (Aborted)
**
in thread 7ff9df594700 thread_name:tp_osd_tp
ceph version 12.2.12 (39cfebf25a7011204a9876d2950e4b28aba66d11) luminous
(stable)
1: (()+0xa59c94) [0x55d25900ec94]
2: (()+0x110e0) [0x7ff9fe9a10e0]
3: (gsignal()+0xcf) [0x7ff9fd968fff]
4: (abort()+0x16a) [0x7ff9fd96a42a]
5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x28e) [0x55d2590573ee]
6: (PrimaryLogPG::on_local_recover(hobject_t const&, ObjectRecoveryInfo
const&, std::shared_ptr<ObjectContext>, bool,
ObjectStore::Transaction*)+0x1287) [0x55d258bad597]
7: (ReplicatedBackend::handle_push(pg_shard_t, PushOp const&, PushReplyOp*,
ObjectStore::Transaction*)+0x305) [0x55d258d3d6e5]
8: (ReplicatedBackend::_do_push(boost::intrusive_ptr<OpRequest>)+0x12e)
[0x55d258d3d8fe]
9: (ReplicatedBackend::_handle_message(boost::intrusive_ptr<OpRequest>)+0x2e3)
[0x55d258d4d723]
10: (PGBackend::handle_message(boost::intrusive_ptr<OpRequest>)+0x50)
[0x55d258c50ce0]
11: (PrimaryLogPG::do_request(boost::intrusive_ptr<OpRequest>&,
ThreadPool::TPHandle&)+0x4f1) [0x55d258bb44a1]
12: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
boost::intrusive_ptr<OpRequest>, ThreadPool::TPHandle&)+0x3ab) [0x55d258a21dcb]
13: (PGQueueable::RunVis::operator()(boost::intrusive_ptr<OpRequest>
const&)+0x5a) [0x55d258cda97a]
14: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x102d) [0x55d258a4fdbd]
15: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x8ef)
[0x55d25905c0cf]
16: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55d25905f3d0]
17: (()+0x74a4) [0x7ff9fe9974a4]
18: (clone()+0x3f) [0x7ff9fda1ed0f]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is needed to
interpret this.
--- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
1/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 1 reserver
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 xio
1/ 5 compressor
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
4/ 5 memdb
1/ 5 kinetic
1/ 5 fuse
1/ 5 mgr
1/ 5 mgrc
1/ 5 dpdk
1/ 5 eventtrace
-2/-2 (syslog threshold)
-1/-1 (stderr threshold)
max_recent 10000
max_new 1000
log_file /var/log/ceph/ceph-osd.35.log
--- end dump of recent events ---
Any help would be very much appreciated.
All the Best
Kevin
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com