I found a fun failure mode this weekend.

I have 6 SSDs in my 6-node Ceph cluster at home.  The SSDs are partitioned;
about half of the SSD is used for journal space for other OSDs, and half
holds an OSD for a cache tier.  I finally turned it on the cache late last
week, and everything was great, until yesterday morning, when my whole
cluster was down, hard.

Apparently, I mis-set target_max_bytes, because 5 of the 6 SSD partitions
were 100% full.  On the 5 full machines (running Ubuntu's 3.14.1 kernel),
the cache filesystem was unreadable; any attempt to access it threw kernel
errors.  Rebooting cleared up 2 of those, leaving me with 3 of 6 devices
alive in the pool, and 3 devices with corrupt filesystems.

Apparently btrfs really, *REALLY* doesn't like full filesystems, because
filling them 100% full seems to have fatally corrupted them.  No power
loss, etc. involved.

Trying to mount the filesystems fail, giving btrfs messages like this:

[81720.111053] BTRFS: device fsid 319cbd8a-71ac-4b42-9d5c-b02658e75cdc
devid 1 transid 61429 /dev/sde9
[81720.113074] BTRFS info (device sde9): disk space caching is enabled
[81720.188759] BTRFS: detected SSD devices, enabling SSD mode
[81720.195442] BTRFS error (device sde9): block group 36528193536 has wrong
amount of free space
[81720.195488] BTRFS error (device sde9): failed to load free space cache
for block group 36528193536
[81720.205248] btree_readpage_end_io_hook: 69 callbacks suppressed
[81720.205252] BTRFS: bad tree block start 0 395247616
[81720.205622] BTRFS: bad tree block start 0 395247616
[81720.212772] BTRFS: bad tree block start 0 39714816
[81720.213152] BTRFS: bad tree block start 0 39714816
[81720.213551] BTRFS: bad tree block start 0 39714816
[81720.213925] BTRFS: bad tree block start 0 39714816
[81720.214324] BTRFS: bad tree block start 0 39714816
[81720.214697] BTRFS: bad tree block start 0 39714816
[81720.215070] BTRFS: bad tree block start 0 39714816
[81720.215441] BTRFS: bad tree block start 0 39714816
[81720.246457] BTRFS: error (device sde9) in open_ctree:2839: errno=-5 IO
failure (Failed to recover log tree)
[81720.277276] BTRFS: open_ctree failed

btrfsck wasn't helpful on the one system that I tried it on.  Nor was
mounting with -o ro,recovery.  I can mount the filesystems if I run
btrfs-zero-log (after dding a FS image), but Ceph is unhappy:


# ceph-osd -i 9 -d
2014-06-02 08:10:49.217019 7f9873cc4800  0 ceph version 0.80.1
(a38fe1169b6d2ac98b427334c12d7cf81f809b74), process ceph-osd, pid 17600
starting osd.9 at :/0 osd_data /var/lib/ceph/osd/ceph-9
/var/lib/ceph/osd/ceph-9/journal
2014-06-02 08:10:49.219400 7f9873cc4800  0
filestore(/var/lib/ceph/osd/ceph-9) mount detected btrfs
2014-06-02 08:10:49.232826 7f9873cc4800  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_features: FIEMAP
ioctl is supported and appears to work
2014-06-02 08:10:49.232838 7f9873cc4800  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_features: FIEMAP
ioctl is disabled via 'filestore fiemap' config option
2014-06-02 08:10:49.247357 7f9873cc4800  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_features:
syncfs(2) syscall fully supported (by glibc and kernel)
2014-06-02 08:10:49.247677 7f9873cc4800  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature: CLONE_RANGE
ioctl is supported
2014-06-02 08:10:49.261718 7f9873cc4800  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature: SNAP_CREATE
is supported
2014-06-02 08:10:49.262442 7f9873cc4800  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature:
SNAP_DESTROY is supported
2014-06-02 08:10:49.263020 7f9873cc4800  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature: START_SYNC
is supported (transid 71371)
2014-06-02 08:10:49.269221 7f9873cc4800  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature: WAIT_SYNC
is supported
2014-06-02 08:10:49.270902 7f9873cc4800  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature:
SNAP_CREATE_V2 is supported
2014-06-02 08:10:49.275792 7f9873cc4800  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) list_checkpoints: stat
'/var/lib/ceph/osd/ceph-9/snap_3415219' failed: (12) Cannot allocate memory
2014-06-02 08:10:49.275900 7f9873cc4800 -1
filestore(/var/lib/ceph/osd/ceph-9) FileStore::mount : error in
_list_snaps: (12) Cannot allocate memory
2014-06-02 08:10:49.275936 7f9873cc4800 -1  ** ERROR: error converting
store /var/lib/ceph/osd/ceph-9: (12) Cannot allocate memory


Similarly, I can recover most of the data via 'btrfs restore', but Ceph has
a different failure mode:

# ceph-osd -i 16 -d
2014-06-02 08:12:41.590122 7fdfda65e800  0 ceph version 0.80.1
(a38fe1169b6d2ac98b427334c12d7cf81f809b74), process ceph-osd, pid 5094
starting osd.16 at :/0 osd_data /var/lib/ceph/osd/ceph-16
/var/lib/ceph/osd/ceph-16/journal
2014-06-02 08:12:41.621624 7fdfda65e800  0
filestore(/var/lib/ceph/osd/ceph-16) mount detected btrfs
2014-06-02 08:12:41.693025 7fdfda65e800  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_features: FIEMAP
ioctl is supported and appears to work
2014-06-02 08:12:41.693035 7fdfda65e800  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_features: FIEMAP
ioctl is disabled via 'filestore fiemap' config option
2014-06-02 08:12:41.794817 7fdfda65e800  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_features:
syncfs(2) syscall fully supported (by glibc and kernel)
2014-06-02 08:12:41.795263 7fdfda65e800  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature:
CLONE_RANGE ioctl is supported
2014-06-02 08:12:42.019636 7fdfda65e800  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature:
SNAP_CREATE is supported
2014-06-02 08:12:42.020809 7fdfda65e800  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature:
SNAP_DESTROY is supported
2014-06-02 08:12:42.020961 7fdfda65e800  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature: START_SYNC
is supported (transid 68342)
2014-06-02 08:12:42.136140 7fdfda65e800  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature: WAIT_SYNC
is supported
2014-06-02 08:12:42.146701 7fdfda65e800  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature:
SNAP_CREATE_V2 is supported
2014-06-02 08:12:42.147929 7fdfda65e800  0
filestore(/var/lib/ceph/osd/ceph-16) mount WARNING: no consistent snaps
found, store may be in inconsistent state
2014-06-02 08:12:42.453012 7fdfda65e800  0
filestore(/var/lib/ceph/osd/ceph-16) mount: enabling PARALLEL journal mode:
fs, checkpoint is enabled
2014-06-02 08:12:42.484983 7fdfda65e800 -1 journal FileJournal::_open:
disabling aio for non-block journal.  Use journal_force_aio to force use of
aio anyway
2014-06-02 08:12:42.485018 7fdfda65e800  1 journal _open
/var/lib/ceph/osd/ceph-16/journal fd 19: 5368709120 bytes, block size 4096
bytes, directio = 1, aio = 0
2014-06-02 08:12:42.506080 7fdfda65e800 -1 journal FileJournal::open:
ondisk fsid 00000000-0000-0000-0000-000000000000 doesn't match expected
259bb594-f316-44ab-a721-8e742d8c1c18, invalid (someone else's?) journal
2014-06-02 08:12:42.506122 7fdfda65e800 -1
filestore(/var/lib/ceph/osd/ceph-16) mount failed to open journal
/var/lib/ceph/osd/ceph-16/journal: (22) Invalid argument
2014-06-02 08:12:42.506271 7fdfda65e800 -1  ** ERROR: error converting
store /var/lib/ceph/osd/ceph-16: (22) Invalid argument


Running --mkjournal (it's just a copy, I don't mind blowing things away)
doesn't help much:

# ceph-osd -i 16 -d
2014-06-02 08:12:52.848067 7f8d748b9800  0 ceph version 0.80.1
(a38fe1169b6d2ac98b427334c12d7cf81f809b74), process ceph-osd, pid 5106
starting osd.16 at :/0 osd_data /var/lib/ceph/osd/ceph-16
/var/lib/ceph/osd/ceph-16/journal
2014-06-02 08:12:52.850669 7f8d748b9800  0
filestore(/var/lib/ceph/osd/ceph-16) mount detected btrfs
2014-06-02 08:12:52.881762 7f8d748b9800  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_features: FIEMAP
ioctl is supported and appears to work
2014-06-02 08:12:52.881772 7f8d748b9800  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_features: FIEMAP
ioctl is disabled via 'filestore fiemap' config option
2014-06-02 08:12:53.025174 7f8d748b9800  0
genericfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_features:
syncfs(2) syscall fully supported (by glibc and kernel)
2014-06-02 08:12:53.025644 7f8d748b9800  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature:
CLONE_RANGE ioctl is supported
2014-06-02 08:12:53.233272 7f8d748b9800  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature:
SNAP_CREATE is supported
2014-06-02 08:12:53.233952 7f8d748b9800  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature:
SNAP_DESTROY is supported
2014-06-02 08:12:53.234088 7f8d748b9800  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature: START_SYNC
is supported (transid 68347)
2014-06-02 08:12:53.341491 7f8d748b9800  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature: WAIT_SYNC
is supported
2014-06-02 08:12:53.352080 7f8d748b9800  0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature:
SNAP_CREATE_V2 is supported
2014-06-02 08:12:53.353056 7f8d748b9800  0
filestore(/var/lib/ceph/osd/ceph-16) mount WARNING: no consistent snaps
found, store may be in inconsistent state
2014-06-02 08:12:53.499770 7f8d748b9800  0
filestore(/var/lib/ceph/osd/ceph-16) mount: enabling PARALLEL journal mode:
fs, checkpoint is enabled
2014-06-02 08:12:53.502813 7f8d748b9800 -1 journal FileJournal::_open:
disabling aio for non-block journal.  Use journal_force_aio to force use of
aio anyway
2014-06-02 08:12:53.502844 7f8d748b9800  1 journal _open
/var/lib/ceph/osd/ceph-16/journal fd 19: 5368709120 bytes, block size 4096
bytes, directio = 1, aio = 0
2014-06-02 08:12:53.503313 7f8d748b9800  1 journal _open
/var/lib/ceph/osd/ceph-16/journal fd 19: 5368709120 bytes, block size 4096
bytes, directio = 1, aio = 0
2014-06-02 08:12:53.503754 7f8d748b9800  1 journal close
/var/lib/ceph/osd/ceph-16/journal
Aborted (core dumped)

Any suggestions?  I'd like to recover the ~900 objects with writeback data
sitting left on the SSDs.


Anyway, the moral of the store: don't use btrfs for your cache devices.

Lost filesystem count, after about 4 weeks and ~30 OSDs:

  xfs: 1  (power loss -> directory structure trashed)
  btrfs: 3

I'm starting to miss ext3.


Scott
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to