Re: [ceph-users] btrfs + cache tier = disaster

Scott Laird Mon, 02 Jun 2014 09:26:42 -0700

I can cope with single-FS failures, within reason.  It's the coordinated
failures across multiple servers that really freak me out.



On Mon, Jun 2, 2014 at 8:47 AM, Thorwald Lundqvist <[email protected]>
wrote:

> I'm sorry to hear about that.
>
> I'd say don't use btrfs at all, it has proven unstable for us in
> production even without cache. It's just not ready for production use.
>
>
> On Mon, Jun 2, 2014 at 5:20 PM, Scott Laird <[email protected]> wrote:
>
>> I found a fun failure mode this weekend.
>>
>> I have 6 SSDs in my 6-node Ceph cluster at home.  The SSDs are
>> partitioned; about half of the SSD is used for journal space for other
>> OSDs, and half holds an OSD for a cache tier.  I finally turned it on the
>> cache late last week, and everything was great, until yesterday morning,
>> when my whole cluster was down, hard.
>>
>> Apparently, I mis-set target_max_bytes, because 5 of the 6 SSD partitions
>> were 100% full.  On the 5 full machines (running Ubuntu's 3.14.1 kernel),
>> the cache filesystem was unreadable; any attempt to access it threw kernel
>> errors.  Rebooting cleared up 2 of those, leaving me with 3 of 6 devices
>> alive in the pool, and 3 devices with corrupt filesystems.
>>
>> Apparently btrfs really, *REALLY* doesn't like full filesystems, because
>> filling them 100% full seems to have fatally corrupted them.  No power
>> loss, etc. involved.
>>
>> Trying to mount the filesystems fail, giving btrfs messages like this:
>>
>> [81720.111053] BTRFS: device fsid 319cbd8a-71ac-4b42-9d5c-b02658e75cdc
>> devid 1 transid 61429 /dev/sde9
>> [81720.113074] BTRFS info (device sde9): disk space caching is enabled
>> [81720.188759] BTRFS: detected SSD devices, enabling SSD mode
>> [81720.195442] BTRFS error (device sde9): block group 36528193536 has
>> wrong amount of free space
>> [81720.195488] BTRFS error (device sde9): failed to load free space cache
>> for block group 36528193536
>> [81720.205248] btree_readpage_end_io_hook: 69 callbacks suppressed
>> [81720.205252] BTRFS: bad tree block start 0 395247616
>> [81720.205622] BTRFS: bad tree block start 0 395247616
>> [81720.212772] BTRFS: bad tree block start 0 39714816
>> [81720.213152] BTRFS: bad tree block start 0 39714816
>> [81720.213551] BTRFS: bad tree block start 0 39714816
>> [81720.213925] BTRFS: bad tree block start 0 39714816
>> [81720.214324] BTRFS: bad tree block start 0 39714816
>> [81720.214697] BTRFS: bad tree block start 0 39714816
>> [81720.215070] BTRFS: bad tree block start 0 39714816
>> [81720.215441] BTRFS: bad tree block start 0 39714816
>> [81720.246457] BTRFS: error (device sde9) in open_ctree:2839: errno=-5 IO
>> failure (Failed to recover log tree)
>> [81720.277276] BTRFS: open_ctree failed
>>
>> btrfsck wasn't helpful on the one system that I tried it on.  Nor was
>> mounting with -o ro,recovery.  I can mount the filesystems if I run
>> btrfs-zero-log (after dding a FS image), but Ceph is unhappy:
>>
>>
>> # ceph-osd -i 9 -d
>> 2014-06-02 08:10:49.217019 7f9873cc4800  0 ceph version 0.80.1
>> (a38fe1169b6d2ac98b427334c12d7cf81f809b74), process ceph-osd, pid 17600
>> starting osd.9 at :/0 osd_data /var/lib/ceph/osd/ceph-9
>> /var/lib/ceph/osd/ceph-9/journal
>> 2014-06-02 08:10:49.219400 7f9873cc4800  0
>> filestore(/var/lib/ceph/osd/ceph-9) mount detected btrfs
>> 2014-06-02 08:10:49.232826 7f9873cc4800  0
>> genericfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_features: FIEMAP
>> ioctl is supported and appears to work
>> 2014-06-02 08:10:49.232838 7f9873cc4800  0
>> genericfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_features: FIEMAP
>> ioctl is disabled via 'filestore fiemap' config option
>> 2014-06-02 08:10:49.247357 7f9873cc4800  0
>> genericfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_features:
>> syncfs(2) syscall fully supported (by glibc and kernel)
>> 2014-06-02 08:10:49.247677 7f9873cc4800  0
>> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature: CLONE_RANGE
>> ioctl is supported
>> 2014-06-02 08:10:49.261718 7f9873cc4800  0
>> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature: SNAP_CREATE
>> is supported
>> 2014-06-02 08:10:49.262442 7f9873cc4800  0
>> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature:
>> SNAP_DESTROY is supported
>> 2014-06-02 08:10:49.263020 7f9873cc4800  0
>> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature: START_SYNC
>> is supported (transid 71371)
>> 2014-06-02 08:10:49.269221 7f9873cc4800  0
>> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature: WAIT_SYNC
>> is supported
>> 2014-06-02 08:10:49.270902 7f9873cc4800  0
>> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature:
>> SNAP_CREATE_V2 is supported
>> 2014-06-02 08:10:49.275792 7f9873cc4800  0
>> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) list_checkpoints: stat
>> '/var/lib/ceph/osd/ceph-9/snap_3415219' failed: (12) Cannot allocate memory
>> 2014-06-02 08:10:49.275900 7f9873cc4800 -1
>> filestore(/var/lib/ceph/osd/ceph-9) FileStore::mount : error in
>> _list_snaps: (12) Cannot allocate memory
>> 2014-06-02 08:10:49.275936 7f9873cc4800 -1  ** ERROR: error converting
>> store /var/lib/ceph/osd/ceph-9: (12) Cannot allocate memory
>>
>>
>> Similarly, I can recover most of the data via 'btrfs restore', but Ceph
>> has a different failure mode:
>>
>> # ceph-osd -i 16 -d
>> 2014-06-02 08:12:41.590122 7fdfda65e800  0 ceph version 0.80.1
>> (a38fe1169b6d2ac98b427334c12d7cf81f809b74), process ceph-osd, pid 5094
>> starting osd.16 at :/0 osd_data /var/lib/ceph/osd/ceph-16
>> /var/lib/ceph/osd/ceph-16/journal
>> 2014-06-02 08:12:41.621624 7fdfda65e800  0
>> filestore(/var/lib/ceph/osd/ceph-16) mount detected btrfs
>> 2014-06-02 08:12:41.693025 7fdfda65e800  0
>> genericfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_features: FIEMAP
>> ioctl is supported and appears to work
>> 2014-06-02 08:12:41.693035 7fdfda65e800  0
>> genericfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_features: FIEMAP
>> ioctl is disabled via 'filestore fiemap' config option
>> 2014-06-02 08:12:41.794817 7fdfda65e800  0
>> genericfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_features:
>> syncfs(2) syscall fully supported (by glibc and kernel)
>> 2014-06-02 08:12:41.795263 7fdfda65e800  0
>> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature:
>> CLONE_RANGE ioctl is supported
>> 2014-06-02 08:12:42.019636 7fdfda65e800  0
>> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature:
>> SNAP_CREATE is supported
>> 2014-06-02 08:12:42.020809 7fdfda65e800  0
>> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature:
>> SNAP_DESTROY is supported
>> 2014-06-02 08:12:42.020961 7fdfda65e800  0
>> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature: START_SYNC
>> is supported (transid 68342)
>> 2014-06-02 08:12:42.136140 7fdfda65e800  0
>> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature: WAIT_SYNC
>> is supported
>> 2014-06-02 08:12:42.146701 7fdfda65e800  0
>> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature:
>> SNAP_CREATE_V2 is supported
>> 2014-06-02 08:12:42.147929 7fdfda65e800  0
>> filestore(/var/lib/ceph/osd/ceph-16) mount WARNING: no consistent snaps
>> found, store may be in inconsistent state
>> 2014-06-02 08:12:42.453012 7fdfda65e800  0
>> filestore(/var/lib/ceph/osd/ceph-16) mount: enabling PARALLEL journal mode:
>> fs, checkpoint is enabled
>> 2014-06-02 08:12:42.484983 7fdfda65e800 -1 journal FileJournal::_open:
>> disabling aio for non-block journal.  Use journal_force_aio to force use of
>> aio anyway
>> 2014-06-02 08:12:42.485018 7fdfda65e800  1 journal _open
>> /var/lib/ceph/osd/ceph-16/journal fd 19: 5368709120 bytes, block size 4096
>> bytes, directio = 1, aio = 0
>> 2014-06-02 08:12:42.506080 7fdfda65e800 -1 journal FileJournal::open:
>> ondisk fsid 00000000-0000-0000-0000-000000000000 doesn't match expected
>> 259bb594-f316-44ab-a721-8e742d8c1c18, invalid (someone else's?) journal
>> 2014-06-02 08:12:42.506122 7fdfda65e800 -1
>> filestore(/var/lib/ceph/osd/ceph-16) mount failed to open journal
>> /var/lib/ceph/osd/ceph-16/journal: (22) Invalid argument
>> 2014-06-02 08:12:42.506271 7fdfda65e800 -1  ** ERROR: error converting
>> store /var/lib/ceph/osd/ceph-16: (22) Invalid argument
>>
>>
>> Running --mkjournal (it's just a copy, I don't mind blowing things away)
>> doesn't help much:
>>
>> # ceph-osd -i 16 -d
>> 2014-06-02 08:12:52.848067 7f8d748b9800  0 ceph version 0.80.1
>> (a38fe1169b6d2ac98b427334c12d7cf81f809b74), process ceph-osd, pid 5106
>> starting osd.16 at :/0 osd_data /var/lib/ceph/osd/ceph-16
>> /var/lib/ceph/osd/ceph-16/journal
>> 2014-06-02 08:12:52.850669 7f8d748b9800  0
>> filestore(/var/lib/ceph/osd/ceph-16) mount detected btrfs
>> 2014-06-02 08:12:52.881762 7f8d748b9800  0
>> genericfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_features: FIEMAP
>> ioctl is supported and appears to work
>> 2014-06-02 08:12:52.881772 7f8d748b9800  0
>> genericfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_features: FIEMAP
>> ioctl is disabled via 'filestore fiemap' config option
>> 2014-06-02 08:12:53.025174 7f8d748b9800  0
>> genericfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_features:
>> syncfs(2) syscall fully supported (by glibc and kernel)
>> 2014-06-02 08:12:53.025644 7f8d748b9800  0
>> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature:
>> CLONE_RANGE ioctl is supported
>> 2014-06-02 08:12:53.233272 7f8d748b9800  0
>> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature:
>> SNAP_CREATE is supported
>> 2014-06-02 08:12:53.233952 7f8d748b9800  0
>> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature:
>> SNAP_DESTROY is supported
>> 2014-06-02 08:12:53.234088 7f8d748b9800  0
>> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature: START_SYNC
>> is supported (transid 68347)
>> 2014-06-02 08:12:53.341491 7f8d748b9800  0
>> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature: WAIT_SYNC
>> is supported
>> 2014-06-02 08:12:53.352080 7f8d748b9800  0
>> btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature:
>> SNAP_CREATE_V2 is supported
>> 2014-06-02 08:12:53.353056 7f8d748b9800  0
>> filestore(/var/lib/ceph/osd/ceph-16) mount WARNING: no consistent snaps
>> found, store may be in inconsistent state
>> 2014-06-02 08:12:53.499770 7f8d748b9800  0
>> filestore(/var/lib/ceph/osd/ceph-16) mount: enabling PARALLEL journal mode:
>> fs, checkpoint is enabled
>> 2014-06-02 08:12:53.502813 7f8d748b9800 -1 journal FileJournal::_open:
>> disabling aio for non-block journal.  Use journal_force_aio to force use of
>> aio anyway
>> 2014-06-02 08:12:53.502844 7f8d748b9800  1 journal _open
>> /var/lib/ceph/osd/ceph-16/journal fd 19: 5368709120 bytes, block size 4096
>> bytes, directio = 1, aio = 0
>> 2014-06-02 08:12:53.503313 7f8d748b9800  1 journal _open
>> /var/lib/ceph/osd/ceph-16/journal fd 19: 5368709120 bytes, block size 4096
>> bytes, directio = 1, aio = 0
>> 2014-06-02 08:12:53.503754 7f8d748b9800  1 journal close
>> /var/lib/ceph/osd/ceph-16/journal
>> Aborted (core dumped)
>>
>> Any suggestions?  I'd like to recover the ~900 objects with writeback
>> data sitting left on the SSDs.
>>
>>
>> Anyway, the moral of the store: don't use btrfs for your cache devices.
>>
>> Lost filesystem count, after about 4 weeks and ~30 OSDs:
>>
>>   xfs: 1  (power loss -> directory structure trashed)
>>   btrfs: 3
>>
>> I'm starting to miss ext3.
>>
>>
>> Scott
>>
>> _______________________________________________
>> ceph-users mailing list
>> [email protected]
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] btrfs + cache tier = disaster

Reply via email to