I found a fun failure mode this weekend. I have 6 SSDs in my 6-node Ceph cluster at home. The SSDs are partitioned; about half of the SSD is used for journal space for other OSDs, and half holds an OSD for a cache tier. I finally turned it on the cache late last week, and everything was great, until yesterday morning, when my whole cluster was down, hard.
Apparently, I mis-set target_max_bytes, because 5 of the 6 SSD partitions were 100% full. On the 5 full machines (running Ubuntu's 3.14.1 kernel), the cache filesystem was unreadable; any attempt to access it threw kernel errors. Rebooting cleared up 2 of those, leaving me with 3 of 6 devices alive in the pool, and 3 devices with corrupt filesystems. Apparently btrfs really, *REALLY* doesn't like full filesystems, because filling them 100% full seems to have fatally corrupted them. No power loss, etc. involved. Trying to mount the filesystems fail, giving btrfs messages like this: [81720.111053] BTRFS: device fsid 319cbd8a-71ac-4b42-9d5c-b02658e75cdc devid 1 transid 61429 /dev/sde9 [81720.113074] BTRFS info (device sde9): disk space caching is enabled [81720.188759] BTRFS: detected SSD devices, enabling SSD mode [81720.195442] BTRFS error (device sde9): block group 36528193536 has wrong amount of free space [81720.195488] BTRFS error (device sde9): failed to load free space cache for block group 36528193536 [81720.205248] btree_readpage_end_io_hook: 69 callbacks suppressed [81720.205252] BTRFS: bad tree block start 0 395247616 [81720.205622] BTRFS: bad tree block start 0 395247616 [81720.212772] BTRFS: bad tree block start 0 39714816 [81720.213152] BTRFS: bad tree block start 0 39714816 [81720.213551] BTRFS: bad tree block start 0 39714816 [81720.213925] BTRFS: bad tree block start 0 39714816 [81720.214324] BTRFS: bad tree block start 0 39714816 [81720.214697] BTRFS: bad tree block start 0 39714816 [81720.215070] BTRFS: bad tree block start 0 39714816 [81720.215441] BTRFS: bad tree block start 0 39714816 [81720.246457] BTRFS: error (device sde9) in open_ctree:2839: errno=-5 IO failure (Failed to recover log tree) [81720.277276] BTRFS: open_ctree failed btrfsck wasn't helpful on the one system that I tried it on. Nor was mounting with -o ro,recovery. I can mount the filesystems if I run btrfs-zero-log (after dding a FS image), but Ceph is unhappy: # ceph-osd -i 9 -d 2014-06-02 08:10:49.217019 7f9873cc4800 0 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74), process ceph-osd, pid 17600 starting osd.9 at :/0 osd_data /var/lib/ceph/osd/ceph-9 /var/lib/ceph/osd/ceph-9/journal 2014-06-02 08:10:49.219400 7f9873cc4800 0 filestore(/var/lib/ceph/osd/ceph-9) mount detected btrfs 2014-06-02 08:10:49.232826 7f9873cc4800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_features: FIEMAP ioctl is supported and appears to work 2014-06-02 08:10:49.232838 7f9873cc4800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2014-06-02 08:10:49.247357 7f9873cc4800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_features: syncfs(2) syscall fully supported (by glibc and kernel) 2014-06-02 08:10:49.247677 7f9873cc4800 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature: CLONE_RANGE ioctl is supported 2014-06-02 08:10:49.261718 7f9873cc4800 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature: SNAP_CREATE is supported 2014-06-02 08:10:49.262442 7f9873cc4800 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature: SNAP_DESTROY is supported 2014-06-02 08:10:49.263020 7f9873cc4800 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature: START_SYNC is supported (transid 71371) 2014-06-02 08:10:49.269221 7f9873cc4800 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature: WAIT_SYNC is supported 2014-06-02 08:10:49.270902 7f9873cc4800 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature: SNAP_CREATE_V2 is supported 2014-06-02 08:10:49.275792 7f9873cc4800 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) list_checkpoints: stat '/var/lib/ceph/osd/ceph-9/snap_3415219' failed: (12) Cannot allocate memory 2014-06-02 08:10:49.275900 7f9873cc4800 -1 filestore(/var/lib/ceph/osd/ceph-9) FileStore::mount : error in _list_snaps: (12) Cannot allocate memory 2014-06-02 08:10:49.275936 7f9873cc4800 -1 ** ERROR: error converting store /var/lib/ceph/osd/ceph-9: (12) Cannot allocate memory Similarly, I can recover most of the data via 'btrfs restore', but Ceph has a different failure mode: # ceph-osd -i 16 -d 2014-06-02 08:12:41.590122 7fdfda65e800 0 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74), process ceph-osd, pid 5094 starting osd.16 at :/0 osd_data /var/lib/ceph/osd/ceph-16 /var/lib/ceph/osd/ceph-16/journal 2014-06-02 08:12:41.621624 7fdfda65e800 0 filestore(/var/lib/ceph/osd/ceph-16) mount detected btrfs 2014-06-02 08:12:41.693025 7fdfda65e800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_features: FIEMAP ioctl is supported and appears to work 2014-06-02 08:12:41.693035 7fdfda65e800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2014-06-02 08:12:41.794817 7fdfda65e800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_features: syncfs(2) syscall fully supported (by glibc and kernel) 2014-06-02 08:12:41.795263 7fdfda65e800 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature: CLONE_RANGE ioctl is supported 2014-06-02 08:12:42.019636 7fdfda65e800 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature: SNAP_CREATE is supported 2014-06-02 08:12:42.020809 7fdfda65e800 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature: SNAP_DESTROY is supported 2014-06-02 08:12:42.020961 7fdfda65e800 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature: START_SYNC is supported (transid 68342) 2014-06-02 08:12:42.136140 7fdfda65e800 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature: WAIT_SYNC is supported 2014-06-02 08:12:42.146701 7fdfda65e800 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature: SNAP_CREATE_V2 is supported 2014-06-02 08:12:42.147929 7fdfda65e800 0 filestore(/var/lib/ceph/osd/ceph-16) mount WARNING: no consistent snaps found, store may be in inconsistent state 2014-06-02 08:12:42.453012 7fdfda65e800 0 filestore(/var/lib/ceph/osd/ceph-16) mount: enabling PARALLEL journal mode: fs, checkpoint is enabled 2014-06-02 08:12:42.484983 7fdfda65e800 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway 2014-06-02 08:12:42.485018 7fdfda65e800 1 journal _open /var/lib/ceph/osd/ceph-16/journal fd 19: 5368709120 bytes, block size 4096 bytes, directio = 1, aio = 0 2014-06-02 08:12:42.506080 7fdfda65e800 -1 journal FileJournal::open: ondisk fsid 00000000-0000-0000-0000-000000000000 doesn't match expected 259bb594-f316-44ab-a721-8e742d8c1c18, invalid (someone else's?) journal 2014-06-02 08:12:42.506122 7fdfda65e800 -1 filestore(/var/lib/ceph/osd/ceph-16) mount failed to open journal /var/lib/ceph/osd/ceph-16/journal: (22) Invalid argument 2014-06-02 08:12:42.506271 7fdfda65e800 -1 ** ERROR: error converting store /var/lib/ceph/osd/ceph-16: (22) Invalid argument Running --mkjournal (it's just a copy, I don't mind blowing things away) doesn't help much: # ceph-osd -i 16 -d 2014-06-02 08:12:52.848067 7f8d748b9800 0 ceph version 0.80.1 (a38fe1169b6d2ac98b427334c12d7cf81f809b74), process ceph-osd, pid 5106 starting osd.16 at :/0 osd_data /var/lib/ceph/osd/ceph-16 /var/lib/ceph/osd/ceph-16/journal 2014-06-02 08:12:52.850669 7f8d748b9800 0 filestore(/var/lib/ceph/osd/ceph-16) mount detected btrfs 2014-06-02 08:12:52.881762 7f8d748b9800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_features: FIEMAP ioctl is supported and appears to work 2014-06-02 08:12:52.881772 7f8d748b9800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_features: FIEMAP ioctl is disabled via 'filestore fiemap' config option 2014-06-02 08:12:53.025174 7f8d748b9800 0 genericfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_features: syncfs(2) syscall fully supported (by glibc and kernel) 2014-06-02 08:12:53.025644 7f8d748b9800 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature: CLONE_RANGE ioctl is supported 2014-06-02 08:12:53.233272 7f8d748b9800 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature: SNAP_CREATE is supported 2014-06-02 08:12:53.233952 7f8d748b9800 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature: SNAP_DESTROY is supported 2014-06-02 08:12:53.234088 7f8d748b9800 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature: START_SYNC is supported (transid 68347) 2014-06-02 08:12:53.341491 7f8d748b9800 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature: WAIT_SYNC is supported 2014-06-02 08:12:53.352080 7f8d748b9800 0 btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature: SNAP_CREATE_V2 is supported 2014-06-02 08:12:53.353056 7f8d748b9800 0 filestore(/var/lib/ceph/osd/ceph-16) mount WARNING: no consistent snaps found, store may be in inconsistent state 2014-06-02 08:12:53.499770 7f8d748b9800 0 filestore(/var/lib/ceph/osd/ceph-16) mount: enabling PARALLEL journal mode: fs, checkpoint is enabled 2014-06-02 08:12:53.502813 7f8d748b9800 -1 journal FileJournal::_open: disabling aio for non-block journal. Use journal_force_aio to force use of aio anyway 2014-06-02 08:12:53.502844 7f8d748b9800 1 journal _open /var/lib/ceph/osd/ceph-16/journal fd 19: 5368709120 bytes, block size 4096 bytes, directio = 1, aio = 0 2014-06-02 08:12:53.503313 7f8d748b9800 1 journal _open /var/lib/ceph/osd/ceph-16/journal fd 19: 5368709120 bytes, block size 4096 bytes, directio = 1, aio = 0 2014-06-02 08:12:53.503754 7f8d748b9800 1 journal close /var/lib/ceph/osd/ceph-16/journal Aborted (core dumped) Any suggestions? I'd like to recover the ~900 objects with writeback data sitting left on the SSDs. Anyway, the moral of the store: don't use btrfs for your cache devices. Lost filesystem count, after about 4 weeks and ~30 OSDs: xfs: 1 (power loss -> directory structure trashed) btrfs: 3 I'm starting to miss ext3. Scott
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
