Re: [ceph-users] OSD daemon randomly stops

Dan Jakubiec Sat, 03 Sep 2016 15:19:46 -0700

Hi Brad, thank you very much for the response:

> On Sep 3, 2016, at 17:05, Brad Hubbard <bhubb...@redhat.com> wrote:
> 
> 
> 
> On Sun, Sep 4, 2016 at 6:21 AM, Dan Jakubiec <dan.jakub...@gmail.com 
> <mailto:dan.jakub...@gmail.com>> wrote:
> 
>> 2016-09-03 16:12:44.124033 7fec728c9700 15
>> filestore(/var/lib/ceph/osd/ceph-4) read
>> 7.80_head/#7:0100377b:::1000019e202.00000000:head# 11644~524288
>> 2016-09-03 16:12:44.129766 7fec6e0c0700 -1 *** Caught signal (Aborted) **
>> in thread 7fec6e0c0700 thread_name:tp_osd_recov
> 
> Can you do a comparison of this object on all replicas (md5sum might be good).
> 
> 7.80_head/#7:0100377b:::1000019e202.00000000:head#
> 
> It's name on disk should be something like 1000019e202.00000000__head_XXX__7 
> if
> I'm not mistaken.
>


Looks like this PG has not yet been replicated.  It only exists on one OSD:

djakubiec@dev:~/rados$ ceph pg map 7.80
osdmap e5262 pg 7.80 (7.80) -> up [9] acting [9]

> Have you tried doing a deep scrub on this pg and checking the OSD logs for 
> scrub
> errors?
> 

I just kicked off a deep-scrub on the PG and go this:

2016-09-03 18:08:54.830432 7f7d97dc1700 -1 log_channel(cluster) log [ERR] : 
7.10 deep-scrub stat mismatch, got 28730/31606 objects, 0/0 clones, 28730/31606 
dirty, 0/0 omap, 0/0 pinned, 0/0 hit_set_archive, 0/0 whiteouts, 
35508027596/38535007872 bytes, 0/0 hit_set_archive bytes.
2016-09-03 18:08:54.830446 7f7d97dc1700 -1 log_channel(cluster) log [ERR] : 
7.10 deep-scrub 1 errors

Can you explain what it means?  What should I do about this?

> 
>>    -7> 2016-09-03 16:12:44.123884 7fec728c9700 20
>> filestore(/var/lib/ceph/osd/ceph-4) fgetattrs 132 getting '_layout'
>>    -6> 2016-09-03 16:12:44.123889 7fec728c9700 10
>> filestore(/var/lib/ceph/osd/ceph-4) getattrs no xattr exists in object_map r
>> = 0
>>    -5> 2016-09-03 16:12:44.123890 7fec728c9700 10
>> filestore(/var/lib/ceph/osd/ceph-4) getattrs
>> 7.80_head/#7:0100377b:::1000019e202.00000000:head# = 0
>>   -29> 2016-09-03 16:12:44.119228 7fec728c9700 20 list_by_hash_bitwise
>> prefix 08FE
>> 7: (ReplicatedPG::update_range(PG::BackfillInterval*,
>> ThreadPool::TPHandle&)+0x614) [0x560bbdc11ac4]
>> 8: (ReplicatedPG::recover_backfill(int, ThreadPool::TPHandle&,
>> bool*)+0x337) [0x560bbdc31c87]
> 
> This looks messed up.
> 
> Is this how it actually looks in the logs?
> 

Yes, I cut and pasted directly.  Which part looks messed up?

Thanks,

-- Dan

> -- 
> Cheers,
> Brad
> 
>> 
>> On Sep 2, 2016, at 12:25, Samuel Just <sj...@redhat.com> wrote:
>> 
>> Probably an EIO.  You can reproduce with debug filestore = 20 to confirm.
>> -Sam
>> 
>> On Fri, Sep 2, 2016 at 10:18 AM, Reed Dier <reed.d...@focusvq.com> wrote:
>> 
>> OSD has randomly stopped for some reason. Lots of recovery processes
>> currently running on the ceph cluster. OSD log with assert below:
>> 
>> -14> 2016-09-02 11:32:38.672460 7fcf65514700  5 -- op tracker -- seq: 1147,
>> time: 2016-09-02 11:32:38.672460, event: queued_for_pg, op:
>> osd_sub_op_reply(unknown.0.0:0 7.d1 MIN [scrub-reserve] ack, result = 0)
>>  -13> 2016-09-02 11:32:38.672533 7fcf70d40700  5 -- op tracker -- seq:
>> 1147, time: 2016-09-02 11:32:38.672533, event: reached_pg, op:
>> osd_sub_op_reply(unknown.0.0:0 7.d1 MIN [scrub-reserve] ack, result = 0)
>>  -12> 2016-09-02 11:32:38.672548 7fcf70d40700  5 -- op tracker -- seq:
>> 1147, time: 2016-09-02 11:32:38.672548, event: started, op:
>> osd_sub_op_reply(unknown.0.0:0 7.d1 MIN [scrub-reserve] ack, result = 0)
>>  -11> 2016-09-02 11:32:38.672548 7fcf7cd58700  1 -- [].28:6800/27735 <==
>> mon.0 [].249:6789/0 60 ==== pg_stats_ack(0 pgs tid 45) v1 ==== 4+0+0 (0 0 0)
>> 0x55a4443b1400 con 0x55a4434a4e80
>>  -10> 2016-09-02 11:32:38.672559 7fcf70d40700  1 -- [].28:6801/27735 -->
>> [].31:6801/2070838 -- osd_sub_op(unknown.0.0:0 7.d1 MIN [scrub-unreserve] v
>> 0'0 snapset=0=[]:[]) v12 -- ?+0 0x55a443aec100 con 0x55a443be0600
>>   -9> 2016-09-02 11:32:38.672571 7fcf70d40700  5 -- op tracker -- seq:
>> 1147, time: 2016-09-02 11:32:38.672571, event: done, op:
>> osd_sub_op_reply(unknown.0.0:0 7.d1 MIN [scrub-reserve] ack, result = 0)
>>   -8> 2016-09-02 11:32:38.681929 7fcf7b555700  1 -- [].28:6801/27735 <==
>> osd.2 [].26:6801/9468 148 ==== MBackfillReserve GRANT  pgid: 15.11,
>> query_epoch: 4235 v3 ==== 30+0+0 (3067148394 0 0) 0x55a4441f65a0 con
>> 0x55a4434ab200
>>   -7> 2016-09-02 11:32:38.682009 7fcf7b555700  5 -- op tracker -- seq:
>> 1148, time: 2016-09-02 11:32:38.682008, event: done, op: MBackfillReserve
>> GRANT  pgid: 15.11, query_epoch: 4235
>>   -6> 2016-09-02 11:32:38.682068 7fcf73545700  5 osd.4 pg_epoch: 4235
>> pg[15.11( v 895'400028 (859'397021,895'400028] local-les=4234 n=166739
>> ec=732 les/c/f 4234/4003/0 4232/4233/4233) [2,4]/[4] r=0 lpr=4233
>> pi=4002-4232/47 (log bound mismatch
>> , actual=[859'396822,895'400028]) bft=2 crt=895'400028 lcod 0'0 mlcod 0'0
>> active+undersized+degraded+remapped+wait_backfill] exit
>> Started/Primary/Active/WaitRemoteBackfillReserved 221.748180 6 0.000056
>>   -5> 2016-09-02 11:32:38.682109 7fcf73545700  5 osd.4 pg_epoch: 4235
>> pg[15.11( v 895'400028 (859'397021,895'400028] local-les=4234 n=166739
>> ec=732 les/c/f 4234/4003/0 4232/4233/4233) [2,4]/[4] r=0 lpr=4233
>> pi=4002-4232/47 (log bound mismatch
>> , actual=[859'396822,895'400028]) bft=2 crt=895'400028 lcod 0'0 mlcod 0'0
>> active+undersized+degraded+remapped+wait_backfill] enter
>> Started/Primary/Active/Backfilling
>>   -4> 2016-09-02 11:32:38.682584 7fcf7b555700  1 -- [].28:6801/27735 <==
>> osd.6 [].30:6801/44406 171 ==== osd pg remove(epoch 4235; pg6.19; ) v2 ====
>> 30+0+0 (522063165 0 0) 0x55a44392f680 con 0x55a443bae100
>>   -3> 2016-09-02 11:32:38.682600 7fcf7b555700  5 -- op tracker -- seq:
>> 1149, time: 2016-09-02 11:32:38.682600, event: started, op: osd pg
>> remove(epoch 4235; pg6.19; )
>>   -2> 2016-09-02 11:32:38.682616 7fcf7b555700  5 osd.4 4235
>> queue_pg_for_deletion: 6.19
>>   -1> 2016-09-02 11:32:38.685425 7fcf7b555700  5 -- op tracker -- seq:
>> 1149, time: 2016-09-02 11:32:38.685421, event: done, op: osd pg remove(epoch
>> 4235; pg6.19; )
>>    0> 2016-09-02 11:32:38.690487 7fcf6c537700 -1 osd/ReplicatedPG.cc: In
>> function 'void ReplicatedPG::scan_range(int, int, PG::BackfillInterval*,
>> ThreadPool::TPHandle&)' thread 7fcf6c537700 time 2016-09-02 11:32:38.688536
>> osd/ReplicatedPG.cc: 11345: FAILED assert(r >= 0)
>> 
>> 2016-09-02 11:32:38.711869 7fcf6c537700 -1 *** Caught signal (Aborted) **
>> 
>> in thread 7fcf6c537700 thread_name:tp_osd_recov
>> 
>> ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>> 1: (()+0x8ebb02) [0x55a402375b02]
>> 2: (()+0x10330) [0x7fcfa2b51330]
>> 3: (gsignal()+0x37) [0x7fcfa0bb3c37]
>> 4: (abort()+0x148) [0x7fcfa0bb7028]
>> 5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x265) [0x55a40246cf85]
>> 6: (ReplicatedPG::scan_range(int, int, PG::BackfillInterval*,
>> ThreadPool::TPHandle&)+0xad2) [0x55a401f4f482]
>> 7: (ReplicatedPG::update_range(PG::BackfillInterval*,
>> ThreadPool::TPHandle&)+0x614) [0x55a401f4fac4]
>> 8: (ReplicatedPG::recover_backfill(int, ThreadPool::TPHandle&,
>> bool*)+0x337) [0x55a401f6fc87]
>> 9: (ReplicatedPG::start_recovery_ops(int, ThreadPool::TPHandle&,
>> int*)+0x8a0) [0x55a401fa1160]
>> 10: (OSD::do_recovery(PG*, ThreadPool::TPHandle&)+0x355) [0x55a401e31555]
>> 11: (OSD::RecoveryWQ::_process(PG*, ThreadPool::TPHandle&)+0xd)
>> [0x55a401e7a0dd]
>> 12: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa6e) [0x55a40245e18e]
>> 13: (ThreadPool::WorkThread::entry()+0x10) [0x55a40245f070]
>> 14: (()+0x8184) [0x7fcfa2b49184]
>> 15: (clone()+0x6d) [0x7fcfa0c7737d]
>> 
>> 
>> Any help with this appreciated.
>> 
>> Thanks,
>> 
>> Reed
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
>> 
>> 
>> _______________________________________________
>> ceph-users mailing list
>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD daemon randomly stops

Reply via email to