[ceph-users] Production 12.2.2 CephFS Cluster still broken, new Details

Tobias Prousa Tue, 12 Dec 2017 00:23:01 -0800

Hi there,

regarding my ML post from yesterday (Upgrade from 12.2.1 to 12.2.2 brokemy CephFs) I was able to get a little further with the suggested"cephfs-table-tool take_inos <max ino>". This made the whole issue withloads of "falsely free-marked inodes" go away.

I then restarted MDS, kept all clients down so no client has mounted FS.Then I started an online MDS scrub


ceph daemon mds.a scrub_path / recursive repair

This again ran for about 3 hours, then MDS again marked FS damaged andchanges its own state to standby (at least that is what I interpret fromwhat I see. This happened exactly at the moment when the scrub hit amissing object. See end of logfile (default log level):

2017-12-11 22:29:05.725484 7fc2342bc700 0 log_channel(cluster) log[WRN] : bad backtrace on inode0x1000d3aede3(/home/some_username/.cache/mozilla/firefox/dsjf5siv.default/safebrowsing/test-unwanted-simple.sbstore),rewriting it2017-12-11 22:29:05.725507 7fc2342bc700 0 log_channel(cluster) log[WRN] : Scrub error on inode 0x1000d3aede3(/home/some_username/.cache/mozilla/firefox/dsjf5siv.default/safebrowsing/test-unwanted-simple.sbstore)see mds.b log and `damage ls` output for details2017-12-11 22:29:05.725569 7fc2342bc700 -1 mds.0.scrubstack_validate_inode_done scrub error on inode [inode 0x1000d3aede3 [2,head]/home/some_username/.cache/mozilla/firefox/dsjf5siv.default/safebrowsing/test-unwanted-simple.sbstoreauth v382 dirtyparent s=232 n(v0 b232 1=1+0) (iversion lock) |dirtyparent=1 scrubqueue=0 0x55ef37c83200]:{"performed_validation":true,"passed_validation":false,"backtrace":{"checked":true,"passed":false,"read_ret_val":-61,"ondisk_value":"(-1)0x0:[]//","memoryvalue":"(0)0x1000d3aede3:[<0x1000d3aeda7/test-unwanted-simple.sbstorev382>,<0x10002de79e8/safebrowsingv7142119>,<0x10002de79df/dsjf5siv.defaultv4089757>,<0x10002de79de/firefox v3998050>,<0x10002de79dd/mozillav4933047>,<0x100018bd837/.cache v115551644>,<0x10000000000/some_usernamev444724510>,<0x1/home v228039388>]//","error_str":"failed to read offdisk; seeretval"},"raw_stats":{"checked":false,"passed":false,"read_ret_val":0,"ondisk_value.dirstat":"f()","ondisk_value.rstat":"n()","memory_value.dirrstat":"f()","memory_value.rstat":"n()","error_str":""},"return_code":-61}2017-12-11 22:29:05.729992 7fc2342bc700 0 log_channel(cluster) log[WRN] : bad backtrace on inode0x1000d3aedf1(/home/some_username/.cache/mozilla/firefox/dsjf5siv.default/safebrowsing/testexcept-flashsubdoc-simple.sbstore),rewriting it2017-12-11 22:29:05.730022 7fc2342bc700 0 log_channel(cluster) log[WRN] : Scrub error on inode 0x1000d3aedf1(/home/some_username/.cache/mozilla/firefox/dsjf5siv.default/safebrowsing/testexcept-flashsubdoc-simple.sbstore)see mds.b log and `damage ls` output for details2017-12-11 22:29:05.730077 7fc2342bc700 -1 mds.0.scrubstack_validate_inode_done scrub error on inode [inode 0x1000d3aedf1 [2,head]/home/some_username/.cache/mozilla/firefox/dsjf5siv.default/safebrowsing/testexcept-flashsubdoc-simple.sbstoreauth v384 dirtyparent s=232 n(v0 b232 1=1+0) (iversion lock) |dirtyparent=1 scrubqueue=0 0x55ef3aa38a00]:{"performed_validation":true,"passed_validation":false,"backtrace":{"checked":true,"passed":false,"read_ret_val":-61,"ondisk_value":"(-1)0x0:[]//","memoryvalue":"(0)0x1000d3aedf1:[<0x1000d3aeda7/testexcept-flashsubdoc-simple.sbstorev384>,<0x10002de79e8/safebrowsingv7142119>,<0x10002de79df/dsjf5siv.defaultv4089757>,<0x10002de79de/firefox v3998050>,<0x10002de79dd/mozillav4933047>,<0x100018bd837/.cache v115551644>,<0x10000000000/some_usernamev444724510>,<0x1/home v228039388>]//","error_str":"failed to read offdisk; seeretval"},"raw_stats":{"checked":false,"passed":false,"read_ret_val":0,"ondisk_value.dirstat":"f()","ondisk_value.rstat":"n()","memory_value.dirrstat":"f()","memory_value.rstat":"n()","error_str":""},"return_code":-61}2017-12-11 22:29:05.733389 7fc2342bc700 0 log_channel(cluster) log[WRN] : bad backtrace on inode0x1000d3aedb6(/home/some_username/.cache/mozilla/firefox/dsjf5siv.default/safebrowsing/test-malware-simple.cache),rewriting it2017-12-11 22:29:05.733420 7fc2342bc700 0 log_channel(cluster) log[WRN] : Scrub error on inode 0x1000d3aedb6(/home/some_username/.cache/mozilla/firefox/dsjf5siv.default/safebrowsing/test-malware-simple.cache)see mds.b log and `damage ls` output for details2017-12-11 22:29:05.733475 7fc2342bc700 -1 mds.0.scrubstack_validate_inode_done scrub error on inode [inode 0x1000d3aedb6 [2,head]/home/some_username/.cache/mozilla/firefox/dsjf5siv.default/safebrowsing/test-malware-simple.cacheauth v366 dirtyparent s=44 n(v0 b44 1=1+0) (iversion lock) |dirtyparent=1 scrubqueue=0 0x55ef37c78a00]:{"performed_validation":true,"passed_validation":false,"backtrace":{"checked":true,"passed":false,"read_ret_val":-61,"ondisk_value":"(-1)0x0:[]//","memoryvalue":"(0)0x1000d3aedb6:[<0x1000d3aeda7/test-malware-simple.cachev366>,<0x10002de79e8/safebrowsingv7142119>,<0x10002de79df/dsjf5siv.defaultv4089757>,<0x10002de79de/firefox v3998050>,<0x10002de79dd/mozillav4933047>,<0x100018bd837/.cache v115551644>,<0x10000000000/some_usernamev444724510>,<0x1/home v228039388>]//","error_str":"failed to read offdisk; seeretval"},"raw_stats":{"checked":false,"passed":false,"read_ret_val":0,"ondisk_value.dirstat":"f()","ondisk_value.rstat":"n()","memory_value.dirrstat":"f()","memory_value.rstat":"n()","error_str":""},"return_code":-61}2017-12-11 22:29:05.772351 7fc2342bc700 0mds.0.cache.dir(0x1000d3ae112) _fetched missing object for [dir0x1000d3ae112/home/some_username/.cache/mozilla/firefox/dsjf5siv.default/safebrowsing-to_delete/[2,head] auth v=0 cv=0/0 ap=1+0+0 state=1073741952 f() n() hs=0+0,ss=0+0| waiter=1 authpin=1 0x55eedee27a80]2017-12-11 22:29:05.772385 7fc2342bc700 -1 log_channel(cluster) log[ERR] : dir 0x1000d3ae112 object missing on disk; some files may be lost(/home/some_username/.cache/mozilla/firefox/dsjf5siv.default/safebrowsing-to_delete)

2017-12-11 22:29:05.778009 7fc2342bc700  1 mds.b respawn
2017-12-11 22:29:05.778028 7fc2342bc700  1 mds.b  e: '/usr/bin/ceph-mds'
2017-12-11 22:29:05.778031 7fc2342bc700  1 mds.b  0: '/usr/bin/ceph-mds'
2017-12-11 22:29:05.778036 7fc2342bc700  1 mds.b  1: '-i'
2017-12-11 22:29:05.778038 7fc2342bc700  1 mds.b  2: 'b'
2017-12-11 22:29:05.778040 7fc2342bc700  1 mds.b  3: '--pid-file'

2017-12-11 22:29:05.778042 7fc2342bc700 1 mds.b 4:'/var/run/ceph/mds.b.pid'

2017-12-11 22:29:05.778044 7fc2342bc700  1 mds.b  5: '-c'
2017-12-11 22:29:05.778046 7fc2342bc700  1 mds.b  6: '/etc/ceph/ceph.conf'
2017-12-11 22:29:05.778048 7fc2342bc700  1 mds.b  7: '--cluster'
2017-12-11 22:29:05.778050 7fc2342bc700  1 mds.b  8: 'ceph'
2017-12-11 22:29:05.778051 7fc2342bc700  1 mds.b  9: '--setuser'
2017-12-11 22:29:05.778053 7fc2342bc700  1 mds.b  10: 'ceph'
2017-12-11 22:29:05.778055 7fc2342bc700  1 mds.b  11: '--setgroup'
2017-12-11 22:29:05.778057 7fc2342bc700  1 mds.b  12: 'ceph'

2017-12-11 22:29:05.778104 7fc2342bc700 1 mds.b respawning with exe/usr/bin/ceph-mds

2017-12-11 22:29:05.778107 7fc2342bc700  1 mds.b  exe_path /proc/self/exe

2017-12-11 22:29:06.186020 7f9ad28f41c0 0 ceph version 12.2.2(cf0baeeeeba3b47f9427c6c97e2144b094b7e5ba) luminous (stable), process(unknown), pid 3214

2017-12-11 22:29:10.604701 7f9acbb38700  1 mds.b handle_mds_map standby

As long as MDS was still active, "damage ls" again gave me exactly 10001damages of damage_type "backtrace". Log implies that those backtracescannot be fixed automatically. I could live with losing those 10k files,but I do not get why MDS switches to "standby" and marks FS damagedrendering it offline.ceph -s then reports something like: mds: cephfs-0/1/1 1:damaged1:standby (not pasted but manually typed from my memory)

Btw. in the log the MDS encountered two more "object missing on disk;some files may be lost" much earlier during that scrub (so three intotal), but the first two did not make the MDS going to standby.I marked FS repaired, restarted MDS with mdf debug level 20 and reran ascrub on that particular path but this time MDS wouldn't mark whole FSdamaged and stayed active. Will it only do so when finding three ofthose damages in a row?

Is this a bug or is there something I would have to do to my cluster toget it back to stable working condition? Again, all this began withupgrading from 12.2.1 to 12.2.2.

Furthermore, is there a way to get rid of those "broken" files (eitherbad backtrace or even more important those with missing objects) as Icould live with losing certain files if it helps getting CephFS workingstable again.

Again, any help is highly appreciated, I need to get the FS back up assoon as possible. Thank you very much!


Best regards,
Tobi



--
-----------------------------------------------------------
Dipl.-Inf. (FH) Tobias Prousa
Leiter Entwicklung Datenlogger

CAETEC GmbH
Industriestr. 1
D-82140 Olching
www.caetec.de

Gesellschaft mit beschränkter Haftung
Sitz der Gesellschaft: Olching
Handelsregister: Amtsgericht München, HRB 183929
Geschäftsführung: Stephan Bacher, Andreas Wocke

Tel.: +49 (0)8142 / 50 13 60
Fax.: +49 (0)8142 / 50 13 69

eMail: tobias.pro...@caetec.de
Web:   http://www.caetec.de
------------------------------------------------------------

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Production 12.2.2 CephFS Cluster still broken, new Details

Reply via email to