Some folks have been interested in running OSDs on different backends so
that if one filesystem has some kind of catastrophic bug that will
effect lots of OSDs, they can still keep Ceph up and running with data
copies on the other file systems.
Whether or not that's a better solution than just picking the most
stable FS available for everything is probably a debatable topic.
Regarding the btrfs problem: No idea unfortunately. Are you hitting a
known bug? Any feedback from the btrfs devs?
Mark
On 06/02/2014 11:25 AM, Scott Laird wrote:
I can cope with single-FS failures, within reason. It's the coordinated
failures across multiple servers that really freak me out.
On Mon, Jun 2, 2014 at 8:47 AM, Thorwald Lundqvist
<[email protected] <mailto:[email protected]>> wrote:
I'm sorry to hear about that.
I'd say don't use btrfs at all, it has proven unstable for us in
production even without cache. It's just not ready for production use.
On Mon, Jun 2, 2014 at 5:20 PM, Scott Laird <[email protected]
<mailto:[email protected]>> wrote:
I found a fun failure mode this weekend.
I have 6 SSDs in my 6-node Ceph cluster at home. The SSDs are
partitioned; about half of the SSD is used for journal space for
other OSDs, and half holds an OSD for a cache tier. I finally
turned it on the cache late last week, and everything was great,
until yesterday morning, when my whole cluster was down, hard.
Apparently, I mis-set target_max_bytes, because 5 of the 6 SSD
partitions were 100% full. On the 5 full machines (running
Ubuntu's 3.14.1 kernel), the cache filesystem was unreadable;
any attempt to access it threw kernel errors. Rebooting cleared
up 2 of those, leaving me with 3 of 6 devices alive in the pool,
and 3 devices with corrupt filesystems.
Apparently btrfs really, *REALLY* doesn't like full filesystems,
because filling them 100% full seems to have fatally corrupted
them. No power loss, etc. involved.
Trying to mount the filesystems fail, giving btrfs messages like
this:
[81720.111053] BTRFS: device fsid
319cbd8a-71ac-4b42-9d5c-b02658e75cdc devid 1 transid 61429 /dev/sde9
[81720.113074] BTRFS info (device sde9): disk space caching is
enabled
[81720.188759] BTRFS: detected SSD devices, enabling SSD mode
[81720.195442] BTRFS error (device sde9): block group
36528193536 has wrong amount of free space
[81720.195488] BTRFS error (device sde9): failed to load free
space cache for block group 36528193536
[81720.205248] btree_readpage_end_io_hook: 69 callbacks suppressed
[81720.205252] BTRFS: bad tree block start 0 395247616
[81720.205622] BTRFS: bad tree block start 0 395247616
[81720.212772] BTRFS: bad tree block start 0 39714816
[81720.213152] BTRFS: bad tree block start 0 39714816
[81720.213551] BTRFS: bad tree block start 0 39714816
[81720.213925] BTRFS: bad tree block start 0 39714816
[81720.214324] BTRFS: bad tree block start 0 39714816
[81720.214697] BTRFS: bad tree block start 0 39714816
[81720.215070] BTRFS: bad tree block start 0 39714816
[81720.215441] BTRFS: bad tree block start 0 39714816
[81720.246457] BTRFS: error (device sde9) in open_ctree:2839:
errno=-5 IO failure (Failed to recover log tree)
[81720.277276] BTRFS: open_ctree failed
btrfsck wasn't helpful on the one system that I tried it on.
Nor was mounting with -o ro,recovery. I can mount the
filesystems if I run btrfs-zero-log (after dding a FS image),
but Ceph is unhappy:
# ceph-osd -i 9 -d
2014-06-02 08:10:49.217019 7f9873cc4800 0 ceph version 0.80.1
(a38fe1169b6d2ac98b427334c12d7cf81f809b74), process ceph-osd,
pid 17600
starting osd.9 at :/0 osd_data /var/lib/ceph/osd/ceph-9
/var/lib/ceph/osd/ceph-9/journal
2014-06-02 08:10:49.219400 7f9873cc4800 0
filestore(/var/lib/ceph/osd/ceph-9) mount detected btrfs
2014-06-02 08:10:49.232826 7f9873cc4800 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-9)
detect_features: FIEMAP ioctl is supported and appears to work
2014-06-02 08:10:49.232838 7f9873cc4800 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-9)
detect_features: FIEMAP ioctl is disabled via 'filestore fiemap'
config option
2014-06-02 08:10:49.247357 7f9873cc4800 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-9)
detect_features: syncfs(2) syscall fully supported (by glibc and
kernel)
2014-06-02 08:10:49.247677 7f9873cc4800 0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature:
CLONE_RANGE ioctl is supported
2014-06-02 08:10:49.261718 7f9873cc4800 0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature:
SNAP_CREATE is supported
2014-06-02 08:10:49.262442 7f9873cc4800 0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature:
SNAP_DESTROY is supported
2014-06-02 08:10:49.263020 7f9873cc4800 0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature:
START_SYNC is supported (transid 71371)
2014-06-02 08:10:49.269221 7f9873cc4800 0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature:
WAIT_SYNC is supported
2014-06-02 08:10:49.270902 7f9873cc4800 0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9) detect_feature:
SNAP_CREATE_V2 is supported
2014-06-02 08:10:49.275792 7f9873cc4800 0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-9)
list_checkpoints: stat '/var/lib/ceph/osd/ceph-9/snap_3415219'
failed: (12) Cannot allocate memory
2014-06-02 08:10:49.275900 7f9873cc4800 -1
filestore(/var/lib/ceph/osd/ceph-9) FileStore::mount : error in
_list_snaps: (12) Cannot allocate memory
2014-06-02 08:10:49.275936 7f9873cc4800 -1 ** ERROR: error
converting store /var/lib/ceph/osd/ceph-9: (12) Cannot allocate
memory
Similarly, I can recover most of the data via 'btrfs restore',
but Ceph has a different failure mode:
# ceph-osd -i 16 -d
2014-06-02 08:12:41.590122 7fdfda65e800 0 ceph version 0.80.1
(a38fe1169b6d2ac98b427334c12d7cf81f809b74), process ceph-osd,
pid 5094
starting osd.16 at :/0 osd_data /var/lib/ceph/osd/ceph-16
/var/lib/ceph/osd/ceph-16/journal
2014-06-02 08:12:41.621624 7fdfda65e800 0
filestore(/var/lib/ceph/osd/ceph-16) mount detected btrfs
2014-06-02 08:12:41.693025 7fdfda65e800 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-16)
detect_features: FIEMAP ioctl is supported and appears to work
2014-06-02 08:12:41.693035 7fdfda65e800 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-16)
detect_features: FIEMAP ioctl is disabled via 'filestore fiemap'
config option
2014-06-02 08:12:41.794817 7fdfda65e800 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-16)
detect_features: syncfs(2) syscall fully supported (by glibc and
kernel)
2014-06-02 08:12:41.795263 7fdfda65e800 0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature:
CLONE_RANGE ioctl is supported
2014-06-02 08:12:42.019636 7fdfda65e800 0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature:
SNAP_CREATE is supported
2014-06-02 08:12:42.020809 7fdfda65e800 0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature:
SNAP_DESTROY is supported
2014-06-02 08:12:42.020961 7fdfda65e800 0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature:
START_SYNC is supported (transid 68342)
2014-06-02 08:12:42.136140 7fdfda65e800 0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature:
WAIT_SYNC is supported
2014-06-02 08:12:42.146701 7fdfda65e800 0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature:
SNAP_CREATE_V2 is supported
2014-06-02 08:12:42.147929 7fdfda65e800 0
filestore(/var/lib/ceph/osd/ceph-16) mount WARNING: no
consistent snaps found, store may be in inconsistent state
2014-06-02 08:12:42.453012 7fdfda65e800 0
filestore(/var/lib/ceph/osd/ceph-16) mount: enabling PARALLEL
journal mode: fs, checkpoint is enabled
2014-06-02 08:12:42.484983 7fdfda65e800 -1 journal
FileJournal::_open: disabling aio for non-block journal. Use
journal_force_aio to force use of aio anyway
2014-06-02 08:12:42.485018 7fdfda65e800 1 journal _open
/var/lib/ceph/osd/ceph-16/journal fd 19: 5368709120 bytes, block
size 4096 bytes, directio = 1, aio = 0
2014-06-02 08:12:42.506080 7fdfda65e800 -1 journal
FileJournal::open: ondisk fsid
00000000-0000-0000-0000-000000000000 doesn't match expected
259bb594-f316-44ab-a721-8e742d8c1c18, invalid (someone else's?)
journal
2014-06-02 08:12:42.506122 7fdfda65e800 -1
filestore(/var/lib/ceph/osd/ceph-16) mount failed to open
journal /var/lib/ceph/osd/ceph-16/journal: (22) Invalid argument
2014-06-02 08:12:42.506271 7fdfda65e800 -1 ** ERROR: error
converting store /var/lib/ceph/osd/ceph-16: (22) Invalid argument
Running --mkjournal (it's just a copy, I don't mind blowing
things away) doesn't help much:
# ceph-osd -i 16 -d
2014-06-02 08:12:52.848067 7f8d748b9800 0 ceph version 0.80.1
(a38fe1169b6d2ac98b427334c12d7cf81f809b74), process ceph-osd,
pid 5106
starting osd.16 at :/0 osd_data /var/lib/ceph/osd/ceph-16
/var/lib/ceph/osd/ceph-16/journal
2014-06-02 08:12:52.850669 7f8d748b9800 0
filestore(/var/lib/ceph/osd/ceph-16) mount detected btrfs
2014-06-02 08:12:52.881762 7f8d748b9800 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-16)
detect_features: FIEMAP ioctl is supported and appears to work
2014-06-02 08:12:52.881772 7f8d748b9800 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-16)
detect_features: FIEMAP ioctl is disabled via 'filestore fiemap'
config option
2014-06-02 08:12:53.025174 7f8d748b9800 0
genericfilestorebackend(/var/lib/ceph/osd/ceph-16)
detect_features: syncfs(2) syscall fully supported (by glibc and
kernel)
2014-06-02 08:12:53.025644 7f8d748b9800 0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature:
CLONE_RANGE ioctl is supported
2014-06-02 08:12:53.233272 7f8d748b9800 0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature:
SNAP_CREATE is supported
2014-06-02 08:12:53.233952 7f8d748b9800 0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature:
SNAP_DESTROY is supported
2014-06-02 08:12:53.234088 7f8d748b9800 0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature:
START_SYNC is supported (transid 68347)
2014-06-02 08:12:53.341491 7f8d748b9800 0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature:
WAIT_SYNC is supported
2014-06-02 08:12:53.352080 7f8d748b9800 0
btrfsfilestorebackend(/var/lib/ceph/osd/ceph-16) detect_feature:
SNAP_CREATE_V2 is supported
2014-06-02 08:12:53.353056 7f8d748b9800 0
filestore(/var/lib/ceph/osd/ceph-16) mount WARNING: no
consistent snaps found, store may be in inconsistent state
2014-06-02 08:12:53.499770 7f8d748b9800 0
filestore(/var/lib/ceph/osd/ceph-16) mount: enabling PARALLEL
journal mode: fs, checkpoint is enabled
2014-06-02 08:12:53.502813 7f8d748b9800 -1 journal
FileJournal::_open: disabling aio for non-block journal. Use
journal_force_aio to force use of aio anyway
2014-06-02 08:12:53.502844 7f8d748b9800 1 journal _open
/var/lib/ceph/osd/ceph-16/journal fd 19: 5368709120 bytes, block
size 4096 bytes, directio = 1, aio = 0
2014-06-02 08:12:53.503313 7f8d748b9800 1 journal _open
/var/lib/ceph/osd/ceph-16/journal fd 19: 5368709120 bytes, block
size 4096 bytes, directio = 1, aio = 0
2014-06-02 08:12:53.503754 7f8d748b9800 1 journal close
/var/lib/ceph/osd/ceph-16/journal
Aborted (core dumped)
Any suggestions? I'd like to recover the ~900 objects with
writeback data sitting left on the SSDs.
Anyway, the moral of the store: don't use btrfs for your cache
devices.
Lost filesystem count, after about 4 weeks and ~30 OSDs:
xfs: 1 (power loss -> directory structure trashed)
btrfs: 3
I'm starting to miss ext3.
Scott
_______________________________________________
ceph-users mailing list
[email protected] <mailto:[email protected]>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com