Re: [ceph-users] btrfs + cache tier = disaster

Mark Nelson Mon, 02 Jun 2014 09:32:37 -0700

Some folks have been interested in running OSDs on different backends sothat if one filesystem has some kind of catastrophic bug that willeffect lots of OSDs, they can still keep Ceph up and running with datacopies on the other file systems.

Whether or not that's a better solution than just picking the moststable FS available for everything is probably a debatable topic.

Regarding the btrfs problem: No idea unfortunately. Are you hitting aknown bug? Any feedback from the btrfs devs?


Mark

On 06/02/2014 11:25 AM, Scott Laird wrote:

I can cope with single-FS failures, within reason.  It's the coordinated
failures across multiple servers that really freak me out.


On Mon, Jun 2, 2014 at 8:47 AM, Thorwald Lundqvist
<[email protected] <mailto:[email protected]>> wrote:

    I'm sorry to hear about that.

    I'd say don't use btrfs at all, it has proven unstable for us in
    production even without cache. It's just not ready for production use.


    On Mon, Jun 2, 2014 at 5:20 PM, Scott Laird <[email protected]
    <mailto:[email protected]>> wrote:

        I found a fun failure mode this weekend.

        I have 6 SSDs in my 6-node Ceph cluster at home.  The SSDs are
        partitioned; about half of the SSD is used for journal space for
        other OSDs, and half holds an OSD for a cache tier.  I finally
        turned it on the cache late last week, and everything was great,
        until yesterday morning, when my whole cluster was down, hard.

        Apparently, I mis-set target_max_bytes, because 5 of the 6 SSD
        partitions were 100% full.  On the 5 full machines (running
        Ubuntu's 3.14.1 kernel), the cache filesystem was unreadable;
        any attempt to access it threw kernel errors.  Rebooting cleared
        up 2 of those, leaving me with 3 of 6 devices alive in the pool,
        and 3 devices with corrupt filesystems.

        Apparently btrfs really, *REALLY* doesn't like full filesystems,
        because filling them 100% full seems to have fatally corrupted
        them.  No power loss, etc. involved.

        Trying to mount the filesystems fail, giving btrfs messages like
        this:

        [81720.111053] BTRFS: device fsid
        319cbd8a-71ac-4b42-9d5c-b02658e75cdc devid 1 transid 61429 /dev/sde9
        [81720.113074] BTRFS info (device sde9): disk space caching is
        enabled
        [81720.188759] BTRFS: detected SSD devices, enabling SSD mode
        [81720.195442] BTRFS error (device sde9): block group
        36528193536 has wrong amount of free space
        [81720.195488] BTRFS error (device sde9): failed to load free
        space cache for block group 36528193536
        [81720.205248] btree_readpage_end_io_hook: 69 callbacks suppressed
        [81720.205252] BTRFS: bad tree block start 0 395247616
        [81720.205622] BTRFS: bad tree block start 0 395247616
        [81720.212772] BTRFS: bad tree block start 0 39714816
        [81720.213152] BTRFS: bad tree block start 0 39714816
        [81720.213551] BTRFS: bad tree block start 0 39714816
        [81720.213925] BTRFS: bad tree block start 0 39714816
        [81720.214324] BTRFS: bad tree block start 0 39714816
        [81720.214697] BTRFS: bad tree block start 0 39714816
        [81720.215070] BTRFS: bad tree block start 0 39714816
        [81720.215441] BTRFS: bad tree block start 0 39714816
        [81720.246457] BTRFS: error (device sde9) in open_ctree:2839:
        errno=-5 IO failure (Failed to recover log tree)
        [81720.277276] BTRFS: open_ctree failed

        btrfsck wasn't helpful on the one system that I tried it on.
          Nor was mounting with -o ro,recovery.  I can mount the
        filesystems if I run btrfs-zero-log (after dding a FS image),
        but Ceph is unhappy:


        # ceph-osd -i 9 -d
        2014-06-02 08:10:49.217019 7f9873cc4800  0 ceph version 0.80.1
        (a38fe1169b6d2ac98b427334c12d7cf81f809b74), process ceph-osd,
        pid 17600
        starting osd.9 at :/0 osd_data /var/lib/ceph/osd/ceph-9
        /var/lib/ceph/osd/ceph-9/journal
        2014-06-02 08:10:49.219400 7f9873cc4800  0
        filestore(/var/lib/ceph/osd/ceph-9) mount detected btrfs
        2014-06-02 08:10:49.232826 7f9873cc4800  0
        genericfilestorebackend(/var/lib/ceph/osd/ceph-9)
        detect_features: FIEMAP ioctl is supported and appears to work
        2014-06-02 08:10:49.232838 7f9873cc4800  0
        genericfilestorebackend(/var/lib/ceph/osd/ceph-9)
        detect_features: FIEMAP ioctl is disabled via 'filestore fiemap'
        config option
        2014-06-02 08:10:49.247357 7f9873cc4800  0
        genericfilestorebackend(/var/lib/ceph/osd/ceph-9)
        detect_features: syncfs(2) syscall fully supported (by glibc and
        kernel)
        2014-06-02 08:10:49.247677 7f9873cc4800  0
        btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature:
        CLONE_RANGE ioctl is supported
        2014-06-02 08:10:49.261718 7f9873cc4800  0
        btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature:
        SNAP_CREATE is supported
        2014-06-02 08:10:49.262442 7f9873cc4800  0
        btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature:
        SNAP_DESTROY is supported
        2014-06-02 08:10:49.263020 7f9873cc4800  0
        btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature:
        START_SYNC is supported (transid 71371)
        2014-06-02 08:10:49.269221 7f9873cc4800  0
        btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature:
        WAIT_SYNC is supported
        2014-06-02 08:10:49.270902 7f9873cc4800  0
        btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature:
        SNAP_CREATE_V2 is supported
        2014-06-02 08:10:49.275792 7f9873cc4800  0
        btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9)
        list_checkpoints: stat '/var/lib/ceph/osd/ceph-9/snap_3415219'
        failed: (12) Cannot allocate memory
        2014-06-02 08:10:49.275900 7f9873cc4800 -1
        filestore(/var/lib/ceph/osd/ceph-9) FileStore::mount : error in
        _list_snaps: (12) Cannot allocate memory
        2014-06-02 08:10:49.275936 7f9873cc4800 -1  ** ERROR: error
        converting store /var/lib/ceph/osd/ceph-9: (12) Cannot allocate
        memory


        Similarly, I can recover most of the data via 'btrfs restore',
        but Ceph has a different failure mode:

        # ceph-osd -i 16 -d
        2014-06-02 08:12:41.590122 7fdfda65e800  0 ceph version 0.80.1
        (a38fe1169b6d2ac98b427334c12d7cf81f809b74), process ceph-osd,
        pid 5094
        starting osd.16 at :/0 osd_data /var/lib/ceph/osd/ceph-16
        /var/lib/ceph/osd/ceph-16/journal
        2014-06-02 08:12:41.621624 7fdfda65e800  0
        filestore(/var/lib/ceph/osd/ceph-16) mount detected btrfs
        2014-06-02 08:12:41.693025 7fdfda65e800  0
        genericfilestorebackend(/var/lib/ceph/osd/ceph-16)
        detect_features: FIEMAP ioctl is supported and appears to work
        2014-06-02 08:12:41.693035 7fdfda65e800  0
        genericfilestorebackend(/var/lib/ceph/osd/ceph-16)
        detect_features: FIEMAP ioctl is disabled via 'filestore fiemap'
        config option
        2014-06-02 08:12:41.794817 7fdfda65e800  0
        genericfilestorebackend(/var/lib/ceph/osd/ceph-16)
        detect_features: syncfs(2) syscall fully supported (by glibc and
        kernel)
        2014-06-02 08:12:41.795263 7fdfda65e800  0
        btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature:
        CLONE_RANGE ioctl is supported
        2014-06-02 08:12:42.019636 7fdfda65e800  0
        btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature:
        SNAP_CREATE is supported
        2014-06-02 08:12:42.020809 7fdfda65e800  0
        btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature:
        SNAP_DESTROY is supported
        2014-06-02 08:12:42.020961 7fdfda65e800  0
        btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature:
        START_SYNC is supported (transid 68342)
        2014-06-02 08:12:42.136140 7fdfda65e800  0
        btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature:
        WAIT_SYNC is supported
        2014-06-02 08:12:42.146701 7fdfda65e800  0
        btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature:
        SNAP_CREATE_V2 is supported
        2014-06-02 08:12:42.147929 7fdfda65e800  0
        filestore(/var/lib/ceph/osd/ceph-16) mount WARNING: no
        consistent snaps found, store may be in inconsistent state
        2014-06-02 08:12:42.453012 7fdfda65e800  0
        filestore(/var/lib/ceph/osd/ceph-16) mount: enabling PARALLEL
        journal mode: fs, checkpoint is enabled
        2014-06-02 08:12:42.484983 7fdfda65e800 -1 journal
        FileJournal::_open: disabling aio for non-block journal.  Use
        journal_force_aio to force use of aio anyway
        2014-06-02 08:12:42.485018 7fdfda65e800  1 journal _open
        /var/lib/ceph/osd/ceph-16/journal fd 19: 5368709120 bytes, block
        size 4096 bytes, directio = 1, aio = 0
        2014-06-02 08:12:42.506080 7fdfda65e800 -1 journal
        FileJournal::open: ondisk fsid
        00000000-0000-0000-0000-000000000000 doesn't match expected
        259bb594-f316-44ab-a721-8e742d8c1c18, invalid (someone else's?)
        journal
        2014-06-02 08:12:42.506122 7fdfda65e800 -1
        filestore(/var/lib/ceph/osd/ceph-16) mount failed to open
        journal /var/lib/ceph/osd/ceph-16/journal: (22) Invalid argument
        2014-06-02 08:12:42.506271 7fdfda65e800 -1  ** ERROR: error
        converting store /var/lib/ceph/osd/ceph-16: (22) Invalid argument


        Running --mkjournal (it's just a copy, I don't mind blowing
        things away) doesn't help much:

        # ceph-osd -i 16 -d
        2014-06-02 08:12:52.848067 7f8d748b9800  0 ceph version 0.80.1
        (a38fe1169b6d2ac98b427334c12d7cf81f809b74), process ceph-osd,
        pid 5106
        starting osd.16 at :/0 osd_data /var/lib/ceph/osd/ceph-16
        /var/lib/ceph/osd/ceph-16/journal
        2014-06-02 08:12:52.850669 7f8d748b9800  0
        filestore(/var/lib/ceph/osd/ceph-16) mount detected btrfs
        2014-06-02 08:12:52.881762 7f8d748b9800  0
        genericfilestorebackend(/var/lib/ceph/osd/ceph-16)
        detect_features: FIEMAP ioctl is supported and appears to work
        2014-06-02 08:12:52.881772 7f8d748b9800  0
        genericfilestorebackend(/var/lib/ceph/osd/ceph-16)
        detect_features: FIEMAP ioctl is disabled via 'filestore fiemap'
        config option
        2014-06-02 08:12:53.025174 7f8d748b9800  0
        genericfilestorebackend(/var/lib/ceph/osd/ceph-16)
        detect_features: syncfs(2) syscall fully supported (by glibc and
        kernel)
        2014-06-02 08:12:53.025644 7f8d748b9800  0
        btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature:
        CLONE_RANGE ioctl is supported
        2014-06-02 08:12:53.233272 7f8d748b9800  0
        btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature:
        SNAP_CREATE is supported
        2014-06-02 08:12:53.233952 7f8d748b9800  0
        btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature:
        SNAP_DESTROY is supported
        2014-06-02 08:12:53.234088 7f8d748b9800  0
        btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature:
        START_SYNC is supported (transid 68347)
        2014-06-02 08:12:53.341491 7f8d748b9800  0
        btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature:
        WAIT_SYNC is supported
        2014-06-02 08:12:53.352080 7f8d748b9800  0
        btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature:
        SNAP_CREATE_V2 is supported
        2014-06-02 08:12:53.353056 7f8d748b9800  0
        filestore(/var/lib/ceph/osd/ceph-16) mount WARNING: no
        consistent snaps found, store may be in inconsistent state
        2014-06-02 08:12:53.499770 7f8d748b9800  0
        filestore(/var/lib/ceph/osd/ceph-16) mount: enabling PARALLEL
        journal mode: fs, checkpoint is enabled
        2014-06-02 08:12:53.502813 7f8d748b9800 -1 journal
        FileJournal::_open: disabling aio for non-block journal.  Use
        journal_force_aio to force use of aio anyway
        2014-06-02 08:12:53.502844 7f8d748b9800  1 journal _open
        /var/lib/ceph/osd/ceph-16/journal fd 19: 5368709120 bytes, block
        size 4096 bytes, directio = 1, aio = 0
        2014-06-02 08:12:53.503313 7f8d748b9800  1 journal _open
        /var/lib/ceph/osd/ceph-16/journal fd 19: 5368709120 bytes, block
        size 4096 bytes, directio = 1, aio = 0
        2014-06-02 08:12:53.503754 7f8d748b9800  1 journal close
        /var/lib/ceph/osd/ceph-16/journal
        Aborted (core dumped)

        Any suggestions?  I'd like to recover the ~900 objects with
        writeback data sitting left on the SSDs.


        Anyway, the moral of the store: don't use btrfs for your cache
        devices.

        Lost filesystem count, after about 4 weeks and ~30 OSDs:

           xfs: 1  (power loss -> directory structure trashed)
           btrfs: 3

        I'm starting to miss ext3.


        Scott

        _______________________________________________
        ceph-users mailing list
        [email protected] <mailto:[email protected]>
        http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] btrfs + cache tier = disaster

Reply via email to