[ceph-users] Upgrade from 12.2.1 to 12.2.2 broke my CephFs

Tobias Prousa Mon, 11 Dec 2017 06:13:20 -0800

Hi there,

I'm running a CEPH cluster for some libvirt VMs and a CephFS providing/home to ~20 desktop machines. There are 4 Hosts running 4 MONs, 4MGRs,3MDSs (1 active, 2 standby) and 28 OSDs in total. This cluster is up andrunning since the days of Bobtail (yes, including CephFS).

Now with update from 12.2.1 to 12.2.2 on last friday afternoon Irestarted MONs, MGRs, OSDs as usual. RBD is running just fine. But aftertrying to restart MDSs they tried replaying journal then fell back tostandby and FS was in state "damaged". I finally got them back workingafter I did a good portion of whats described here:


http://docs.ceph.com/docs/master/cephfs/disaster-recovery/

Now when all clients are shut down I can start MDS, will replay andbecome active. I then can mount CephFS on a client and can access myfiles and folders. But the more clients I bring up MDS will first reportdamaged metadata (probably due to some damaged paths, I could live withthat) and then MDS will fail with assert:

/build/ceph-12.2.2/src/mds/MDCache.cc: 258: FAILEDassert(inode_map.count(in->vino()) == 0)


I tried doing an online CephFS scrub like

ceph daemon mds.a scrub_path / recursive repair

This will run for couple of hours, always finding exactly 10001 damagesof type "backtrace" and reporting it would be fixing loads of erronouslyfree-marked inodes until MDS crashes. When I rerun that scrub afterhaving killed all clients and restarted MDSs things will repeat findingexactly those 10001 damages and it will begin fixing those exactly samefree-marked inodes over again.

Btw. CephFS has about 3 million objects in metadata pool. Data pool isabout 30 million objects with ~2.5TB * 3 replicas.


What I tried next is keeping MDS down and doing

cephfs-data-scan scan_extents <data pool>
cephfs-data-scan scan_inodes <data pool>
cephfs-data-scan scan_links

As this is described to take "a very long time" this is what I initiallyskipped from disater-recovery tips. Right now I'm still on first stepwith 6 workers on a single host busy doing cephfs-data-scanscan_extents. ceph -s shows me client io of 20kB/s (!!!). If thats realscan speed this is going to take ages.Any way to tell how long this is going to take? Could I speed things upby running more workers on multiple hosts simultaneously?Should I abort it as I actually don't have the problem of lost files.Maybe running cephfs-data-scan scan_links would better suit my issue, ordoes scan_extents/scan_indoes HAVE to be run and finished first?

I have to get this cluster up and running again as soon as possible. Anyhelp highly appreciated. If there is anything I can help, e.g. withfurther information, feel free to ask. I'll try to hang around on #ceph(nick topro/topro_/topro__). FYI, I'm in Central Europe TimeZone (UTC+1).


Thank you so much!

Best regards,
Tobi

--
-----------------------------------------------------------
Dipl.-Inf. (FH) Tobias Prousa
Leiter Entwicklung Datenlogger

CAETEC GmbH
Industriestr. 1
D-82140 Olching
www.caetec.de

Gesellschaft mit beschränkter Haftung
Sitz der Gesellschaft: Olching
Handelsregister: Amtsgericht München, HRB 183929
Geschäftsführung: Stephan Bacher, Andreas Wocke

Tel.: +49 (0)8142 / 50 13 60
Fax.: +49 (0)8142 / 50 13 69

eMail: [email protected]
Web:   http://www.caetec.de
------------------------------------------------------------

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Upgrade from 12.2.1 to 12.2.2 broke my CephFs

Reply via email to