Re: [ceph-users] MDS damaged after mimic 13.2.1 to 13.2.2 upgrade

Sergey Malinin Sat, 06 Oct 2018 16:05:15 -0700

I'm at scan_links now, will post an update once it has finished.
Have you reset the journal after fs recovery as suggested in the doc?


quote:

If the damaged filesystem contains dirty journal data, it may be recovered next 
with:

cephfs-journal-tool --rank=<original filesystem name>:0 event recover_dentries 
list --alternate-pool recovery
cephfs-journal-tool --rank recovery-fs:0 journal reset --force


> On 7.10.2018, at 00:36, Alfredo Daniel Rezinovsky <[email protected]> 
> wrote:
> 
> I did something wrong in the upgrade restart also...
> 
> after rescaning with:
> 
> cephfs-data-scan scan_extents cephfs_data (with threads)
> 
> cephfs-data-scan scan_inodes cephfs_data (with threads)
> 
> cephfs-data-scan scan_links
> 
> My MDS still crashes and wont replay.
>  1: (()+0x3ec320) [0x55b0e2bd2320]
>  2: (()+0x12890) [0x7fc3adce3890]
>  3: (gsignal()+0xc7) [0x7fc3acddbe97]
>  4: (abort()+0x141) [0x7fc3acddd801]
>  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
> const*)+0x250) [0x7fc3ae3cc080]
>  6: (()+0x26c0f7) [0x7fc3ae3cc0f7]
>  7: (()+0x21eb27) [0x55b0e2a04b27]
>  8: (MDCache::journal_dirty_inode(MutationImpl*, EMetaBlob*, CInode*, 
> snapid_t)+0xc0) [0x55b0e2a04d40]
>  9: (Locker::check_inode_max_size(CInode*, bool, unsigned long, unsigned 
> long, utime_t)+0x91d) [0x55b0e2a6a0fd]
>  10: (RecoveryQueue::_recovered(CInode*, int, unsigned long, utime_t)+0x39f) 
> [0x55b0e2a3ca2f]
>  11: (MDSIOContextBase::complete(int)+0x119) [0x55b0e2b54ab9]
>  12: (Filer::C_Probe::finish(int)+0xe7) [0x55b0e2bd94e7]
>  13: (Context::complete(int)+0x9) [0x55b0e28e9719]
>  14: (Finisher::finisher_thread_entry()+0x12e) [0x7fc3ae3ca4ce]
>  15: (()+0x76db) [0x7fc3adcd86db]
>  16: (clone()+0x3f) [0x7fc3acebe88f]
> 
> Did you do somenthing else before starting the MDSs again?
> 
> On 05/10/18 21:17, Sergey Malinin wrote:
>> I ended up rescanning the entire fs using alternate metadata pool approach 
>> as in http://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/ 
>> <http://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/>
>> The process has not competed yet because during the recovery our cluster 
>> encountered another problem with OSDs that I got fixed yesterday (thanks to 
>> Igor Fedotov @ SUSE).
>> The first stage (scan_extents) completed in 84 hours (120M objects in data 
>> pool on 8 hdd OSDs on 4 hosts). The second (scan_inodes) was interrupted by 
>> OSDs failure so I have no timing stats but it seems to be runing 2-3 times 
>> faster than extents scan.
>> As to root cause -- in my case I recall that during upgrade I had forgotten 
>> to restart 3 OSDs, one of which was holding metadata pool contents, before 
>> restarting MDS daemons and that seemed to had an impact on MDS journal 
>> corruption, because when I restarted those OSDs, MDS was able to start up 
>> but soon failed throwing lots of 'loaded dup inode' errors.
>> 
>> 
>>> On 6.10.2018, at 00:41, Alfredo Daniel Rezinovsky <[email protected] 
>>> <mailto:[email protected]>> wrote:
>>> 
>>> Same problem...
>>> 
>>> # cephfs-journal-tool --journal=purge_queue journal inspect
>>> 2018-10-05 18:37:10.704 7f01f60a9bc0 -1 Missing object 500.0000016c
>>> Overall journal integrity: DAMAGED
>>> Objects missing:
>>>   0x16c
>>> Corrupt regions:
>>>   0x5b000000-ffffffffffffffff
>>> 
>>> Just after upgrade to 13.2.2
>>> 
>>> Did you fixed it?
>>> 
>>> 
>>> On 26/09/18 13:05, Sergey Malinin wrote:
>>>> Hello,
>>>> Followed standard upgrade procedure to upgrade from 13.2.1 to 13.2.2.
>>>> After upgrade MDS cluster is down, mds rank 0 and purge_queue journal are 
>>>> damaged. Resetting purge_queue does not seem to work well as journal still 
>>>> appears to be damaged.
>>>> Can anybody help?
>>>> 
>>>> mds log:
>>>> 
>>>>   -789> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.mds2 Updating MDS map 
>>>> to version 586 from mon.2
>>>>   -788> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 handle_mds_map i 
>>>> am now mds.0.583
>>>>   -787> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 handle_mds_map 
>>>> state change up:rejoin --> up:active
>>>>   -786> 2018-09-26 18:42:32.527 7f70f78b1700  1 mds.0.583 recovery_done -- 
>>>> successful recovery!
>>>> <skip>
>>>>    -38> 2018-09-26 18:42:32.707 7f70f28a7700 -1 mds.0.purge_queue 
>>>> _consume: Decode error at read_pos=0x322ec6636
>>>>    -37> 2018-09-26 18:42:32.707 7f70f28a7700  5 mds.beacon.mds2 
>>>> set_want_state: up:active -> down:damaged
>>>>    -36> 2018-09-26 18:42:32.707 7f70f28a7700  5 mds.beacon.mds2 _send 
>>>> down:damaged seq 137
>>>>    -35> 2018-09-26 18:42:32.707 7f70f28a7700 10 monclient: 
>>>> _send_mon_message to mon.ceph3 at mon:6789/0
>>>>    -34> 2018-09-26 18:42:32.707 7f70f28a7700  1 -- mds:6800/e4cc09cf --> 
>>>> mon:6789/0 -- mdsbeacon(14c72/mds2 down:damaged seq 137 v24a) v7 -- 
>>>> 0x563b321ad480 con 0
>>>> <skip>
>>>>     -3> 2018-09-26 18:42:32.743 7f70f98b5700  5 -- mds:6800/3838577103 >> 
>>>> mon:6789/0 conn(0x563b3213e000 :-1 
>>>> s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=8 cs=1 l=1). rx mon.2 
>>>> seq 29 0x563b321ab880 mdsbeaco
>>>> n(85106/mds2 down:damaged seq 311 v587) v7
>>>>     -2> 2018-09-26 18:42:32.743 7f70f98b5700  1 -- mds:6800/3838577103 <== 
>>>> mon.2 mon:6789/0 29 ==== mdsbeacon(85106/mds2 down:damaged seq 311 v587) 
>>>> v7 ==== 129+0+0 (3296573291 0 0) 0x563b321ab880 con 0x563b3213e
>>>> 000
>>>>     -1> 2018-09-26 18:42:32.743 7f70f98b5700  5 mds.beacon.mds2 
>>>> handle_mds_beacon down:damaged seq 311 rtt 0.038261
>>>>      0> 2018-09-26 18:42:32.743 7f70f28a7700  1 mds.mds2 respawn!
>>>> 
>>>> # cephfs-journal-tool --journal=purge_queue journal inspect
>>>> Overall journal integrity: DAMAGED
>>>> Corrupt regions:
>>>>   0x322ec65d9-ffffffffffffffff
>>>> 
>>>> # cephfs-journal-tool --journal=purge_queue journal reset
>>>> old journal was 13470819801~8463
>>>> new journal start will be 13472104448 (1276184 bytes past old end)
>>>> writing journal head
>>>> done
>>>> 
>>>> # cephfs-journal-tool --journal=purge_queue journal inspect
>>>> 2018-09-26 19:00:52.848 7f3f9fa50bc0 -1 Missing object 500.00000c8c
>>>> Overall journal integrity: DAMAGED
>>>> Objects missing:
>>>>   0xc8c
>>>> Corrupt regions:
>>>>   0x323000000-ffffffffffffffff
>>>> _______________________________________________
>>>> ceph-users mailing list
>>>> [email protected] <mailto:[email protected]>
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>>>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>> 
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] MDS damaged after mimic 13.2.1 to 13.2.2 upgrade

Reply via email to