On Wed, Jun 09, 10:29, Sage Weil wrote: > > I'll let you know if I can trigger it reliably.
I recreated the cephfs using the same setup (7 osds, 3 mons, 3 mds),
and the problem happened again, this time while running "stress"
from two clients over the weekend.
This morning all stress processes were stuck in state D and access
to the ceph fs blocks. Not even ls -l works. I rebooted one machine
that was running stress, cosd, cmds and cmon. Had to power cycle it
as reboot was unable to kill the stress processes.
On reboot, cosd crashes after a few seconds due to hitting
assert(recovering_oids.count(soid) == 0
in start_recovery_op(). gdb output:
osd/PG.cc: In function 'void PG::start_recovery_op(const sobject_t&)':
osd/PG.cc:1833: FAILED assert(recovering_oids.count(soid) == 0)
1: (ReplicatedPG::issue_repop(ReplicatedPG::RepGather*, utime_t,
eversion_t, bool, unsigned long, eversion_t)+0x84e) [0x49c88e]
2: (ReplicatedPG::do_op(MOSDOp*)+0xa9a) [0x4a22ba]
3: (OSD::dequeue_op(PG*)+0x402) [0x4e4dc2]
4: (ThreadPool::worker()+0x1fc) [0x5ec64c]
5: (ThreadPool::WorkThread::entry()+0xd) [0x50480d]
6: (Thread::_entry_func(void*)+0x7) [0x476f57]
7: /lib/libpthread.so.0 [0x7fc9556523f7]
8: (clone()+0x6d) [0x7fc9548a7b4d]
NOTE: a copy of the executable, or `objdump -rdS <executable>` is
needed to interpret this.
terminate called after throwing an instance of 'ceph::FailedAssertion*'
Program received signal SIGABRT, Aborted.
(gdb) bt
#0 0x00007fc954802095 in raise () from /lib/libc.so.6
#1 0x00007fc954803af0 in abort () from /lib/libc.so.6
#2 0x00007fc9550870e4 in __gnu_cxx::__verbose_terminate_handler ()
from /usr/lib/libstdc++.so.6
#3 0x00007fc955085076 in ?? () from /usr/lib/libstdc++.so.6
#4 0x00007fc9550850a3 in std::terminate () from /usr/lib/libstdc++.so.6
#5 0x00007fc95508518a in __cxa_throw () from /usr/lib/libstdc++.so.6
#6 0x00000000005eb22f in ceph::__ceph_assert_fail (assertion=0x6211d0
"recovering_oids.count(soid) == 0", file=0x620113 "osd/PG.cc", line=1833,
func=0x622140 "void PG::start_recovery_op(const sobject_t&)") at
common/assert.cc:30
#7 0x0000000000546b68 in PG::start_recovery_op (this=0xa846d0,
so...@0x12ceb98) at osd/PG.cc:1833
#8 0x000000000049c88e in ReplicatedPG::issue_repop (this=0xa846d0,
repop=0x1296d70, now=<value optimized out>, old_last_update=
{version = 249, epoch = 137, __pad = 0}, old_exists=true,
old_size=4194304, old_version={version = 249, epoch = 137, __pad = 0})
at osd/ReplicatedPG.cc:2280
#9 0x00000000004a22ba in ReplicatedPG::do_op (this=0xa846d0, op=<value
optimized out>) at osd/ReplicatedPG.cc:637
#10 0x00000000004e4dc2 in OSD::dequeue_op (this=0x8a9fc0, pg=0xa846d0)
at osd/OSD.cc:4456
#11 0x00000000005ec64c in ThreadPool::worker (this=0x8aa478) at
common/WorkQueue.cc:44
#12 0x000000000050480d in ThreadPool::WorkThread::entry (this=<value
optimized out>) at ./common/WorkQueue.h:113
#13 0x0000000000476f57 in Thread::_entry_func (arg=0xa63) at
./common/Thread.h:39
#14 0x00007fc9556523f7 in start_thread () from /lib/libpthread.so.0
#15 0x00007fc9548a7b4d in clone () from /lib/libc.so.6
#16 0x0000000000000000 in ?? ()
I also ran the checkpg script as you suggested, but this did not find
any corrupted pgs. The tip of the git branch this cosd was compiled
from is
214a42798b4a5cd57d09c6a13b39b17c4f616aa3 (mds: handle dup anchorclient
ACKs gracefully)
Any hints?
Andre
--
The only person who always got his work done by Friday was Robinson Crusoe
signature.asc
Description: Digital signature
