Re: [ceph-users] Ceph OSDs with bcache experience
Why did you guys go with partitioning the SSD for ceph journals, instead of just using the whole SSD for bcache and leaving the journal on the filesystem (which itself is ontop bcache)? Was there really a benefit to separating the journals from the bcache fronted HDDs? I ask because it has been shown in the past that separating the journal on SSD based pools doesn't really do much. Michal Kozanecki | Linux Administrator | mkozane...@evertz.com -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Wido den Hollander Sent: October-28-15 5:49 AM To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Ceph OSDs with bcache experience On 21-10-15 15:30, Mark Nelson wrote: > > > On 10/21/2015 01:59 AM, Wido den Hollander wrote: >> On 10/20/2015 07:44 PM, Mark Nelson wrote: >>> On 10/20/2015 09:00 AM, Wido den Hollander wrote: >>>> Hi, >>>> >>>> In the "newstore direction" thread on ceph-devel I wrote that I'm >>>> using bcache in production and Mark Nelson asked me to share some details. >>>> >>>> Bcache is running in two clusters now that I manage, but I'll keep >>>> this information to one of them (the one at PCextreme behind CloudStack). >>>> >>>> In this cluster has been running for over 2 years now: >>>> >>>> epoch 284353 >>>> fsid 0d56dd8f-7ae0-4447-b51b-f8b818749307 >>>> created 2013-09-23 11:06:11.819520 >>>> modified 2015-10-20 15:27:48.734213 >>>> >>>> The system consists out of 39 hosts: >>>> >>>> 2U SuperMicro chassis: >>>> * 80GB Intel SSD for OS >>>> * 240GB Intel S3700 SSD for Journaling + Bcache >>>> * 6x 3TB disk >>>> >>>> This isn't the newest hardware. The next batch of hardware will be >>>> more disks per chassis, but this is it for now. >>>> >>>> All systems were installed with Ubuntu 12.04, but they are all >>>> running >>>> 14.04 now with bcache. >>>> >>>> The Intel S3700 SSD is partitioned with a GPT label: >>>> - 5GB Journal for each OSD >>>> - 200GB Partition for bcache >>>> >>>> root@ceph11:~# df -h|grep osd >>>> /dev/bcache02.8T 1.1T 1.8T 38% /var/lib/ceph/osd/ceph-60 >>>> /dev/bcache12.8T 1.2T 1.7T 41% /var/lib/ceph/osd/ceph-61 >>>> /dev/bcache22.8T 930G 1.9T 34% /var/lib/ceph/osd/ceph-62 >>>> /dev/bcache32.8T 970G 1.8T 35% /var/lib/ceph/osd/ceph-63 >>>> /dev/bcache42.8T 814G 2.0T 30% /var/lib/ceph/osd/ceph-64 >>>> /dev/bcache52.8T 915G 1.9T 33% /var/lib/ceph/osd/ceph-65 >>>> root@ceph11:~# >>>> >>>> root@ceph11:~# lsb_release -a >>>> No LSB modules are available. >>>> Distributor ID:Ubuntu >>>> Description:Ubuntu 14.04.3 LTS >>>> Release:14.04 >>>> Codename:trusty >>>> root@ceph11:~# uname -r >>>> 3.19.0-30-generic >>>> root@ceph11:~# >>>> >>>> "apply_latency": { >>>> "avgcount": 2985023, >>>> "sum": 226219.891559000 >>>> } >>>> >>>> What did we notice? >>>> - Less spikes on the disk >>>> - Lower commit latencies on the OSDs >>>> - Almost no 'slow requests' during backfills >>>> - Cache-hit ratio of about 60% >>>> >>>> Max backfills and recovery active are both set to 1 on all OSDs. >>>> >>>> For the next generation hardware we are looking into using 3U >>>> chassis with 16 4TB SATA drives and a 1.2TB NVM-E SSD for bcache, >>>> but we haven't tested those yet, so nothing to say about it. >>>> >>>> The current setup is 200GB of cache for 18TB of disks. The new >>>> setup will be 1200GB for 64TB, curious to see what that does. >>>> >>>> Our main conclusion however is that it does smoothen the >>>> I/O-pattern towards the disks and that gives a overall better >>>> response of the disks. >>> >>> Hi Wido, thanks for the big writeup! Did you guys happen to do any >>> benchmarking? I think Xiaoxi looked at flashcache a while back but >>> had mixed results if I remember right. It would be interesting to >>> know how bcache is affecting performance in different scenarios. >>> >> >> No, we didn't do any benchmarking. Initially this clu
Re: [ceph-users] Having trouble getting good performance
Quick correction/clarification about ZFS and large blocks - ZFS can and will write in 1MB or larger blocks but only with the latest versions with large block support enabled (which I am not sure if ZoL has), by default block aggregation is limited to 128KB. The rest of my post (about multiple vdevs, slog, etc) stands. https://reviews.csiden.org/r/51/ https://www.illumos.org/issues/5027 Cheers, Michal Kozanecki | Linux Administrator | E: mkozane...@evertz.com -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Michal Kozanecki Sent: April-24-15 5:03 PM To: J David; Nick Fisk Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Having trouble getting good performance The ZFS recordsize does NOT equal the size of the write to disk, ZFS will write to disk whatever size it feels is optimal. During a sequential write ZFS will easily write in 1MB blocks or greater. In a spinning-rust CEPH set up like yours, getting the most out of it will require higher io depths. In this case increasing the number of vdevs ZFS sees might help. Instead of a single vdev ontop of a single monolithic 32TB rbd volume, how about a striped ZFS set up with 8 vdevs ontop of 8 smaller 4TB rbd volumes? Also, what sort of SSD are you using for your ZIL/SLOG? Just like there are many bad SSDs for CEPH journal, many of the same performance guidelines apply to SIL/SLOG as well. Cheers, Michal Kozanecki | Linux Administrator | E: mkozane...@evertz.com -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of J David Sent: April-24-15 1:41 PM To: Nick Fisk Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Having trouble getting good performance On Fri, Apr 24, 2015 at 10:58 AM, Nick Fisk n...@fisk.me.uk wrote: 7.2k drives tend to do about 80 iops at 4kb IO sizes, as the IO size increases the number of iops will start to fall. You will probably get around 70 iops for 128kb. But please benchmark your raw disks to get some accurate numbers if needed. Next when you use on-disk journals you write 1st to the journal and then write the actual data. There is also a small levelDB write which stores ceph metadata so depending on IO size you will get slightly less than half the native disk performance. You then have 2 copies, as Ceph won't ACK until both copies have been written the average latency will tend to stray upwards. What is the purpose of the journal if Ceph waits for the actual write to complete anyway? I.e. with a hardware raid card with a BBU, the raid card tells the host that the data is guaranteed safe as soon as it has been written to the BBU. Does this also mean that all the writing internal to ceph happens synchronously? I.e. all these operations are serialized: copy1-journal-write - copy1-data-write - copy2-journal-write - copy2-data-write - OK, client, you're done. Since copy1 and copy2 are on completely different physical hardware, shouldn't those operations be able to proceed more or less independently? And shouldn't the client be done as soon as the journal is written? I.e.: copy1-journal-write -v- copy1-data-write copy2-journal-write -|- copy1-data-write +- OK, client, you're done If so, shouldn't the effective latency be that of one operation, not four? Plus all the non-trivial overhead for scheduling, LevelDB, network latency, etc. For the getting jackhammered by zillions of clients case, your estimate probably holds more true, because even if writes aren't in the critical path they still happen and sooner or later the drive runs out of IOPs and things start getting in each others' way. But for a single client, single thread case where the cluster is *not* 100% utilized, shouldn't the effective latency be much less? The other thing about this that I don't quite understand, and the thing initially had me questioning whether there was something wrong on the Ceph side is that your estimate is based primarily on the mechanical capabilities of the drives. Yet, in practice, when the Ceph cluster is tapped out for I/O in this situation, iostat says none of the physical drives are more than 10-20% busy and doing 10-20 IOPs to write a couple of MB/sec. And those are the loaded ones at any given time. Many are 10%. In fact, *none* of the hardware on the Ceph side is anywhere close to fully utilized. If the performance of this cluster is limited by its hardware, shouldn't there be some evidence of that somewhere? To illustrate, I marked a physical drive out and waited for things to settle down, then ran fio on the physical drive (128KB randwrite numjobs=1 iodepth=1). It yields a very different picture of the drive's physical limits. The drive during maxxed out client writes: Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdl 0.00
Re: [ceph-users] Having trouble getting good performance
The ZFS recordsize does NOT equal the size of the write to disk, ZFS will write to disk whatever size it feels is optimal. During a sequential write ZFS will easily write in 1MB blocks or greater. In a spinning-rust CEPH set up like yours, getting the most out of it will require higher io depths. In this case increasing the number of vdevs ZFS sees might help. Instead of a single vdev ontop of a single monolithic 32TB rbd volume, how about a striped ZFS set up with 8 vdevs ontop of 8 smaller 4TB rbd volumes? Also, what sort of SSD are you using for your ZIL/SLOG? Just like there are many bad SSDs for CEPH journal, many of the same performance guidelines apply to SIL/SLOG as well. Cheers, Michal Kozanecki | Linux Administrator | E: mkozane...@evertz.com -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of J David Sent: April-24-15 1:41 PM To: Nick Fisk Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Having trouble getting good performance On Fri, Apr 24, 2015 at 10:58 AM, Nick Fisk n...@fisk.me.uk wrote: 7.2k drives tend to do about 80 iops at 4kb IO sizes, as the IO size increases the number of iops will start to fall. You will probably get around 70 iops for 128kb. But please benchmark your raw disks to get some accurate numbers if needed. Next when you use on-disk journals you write 1st to the journal and then write the actual data. There is also a small levelDB write which stores ceph metadata so depending on IO size you will get slightly less than half the native disk performance. You then have 2 copies, as Ceph won't ACK until both copies have been written the average latency will tend to stray upwards. What is the purpose of the journal if Ceph waits for the actual write to complete anyway? I.e. with a hardware raid card with a BBU, the raid card tells the host that the data is guaranteed safe as soon as it has been written to the BBU. Does this also mean that all the writing internal to ceph happens synchronously? I.e. all these operations are serialized: copy1-journal-write - copy1-data-write - copy2-journal-write - copy2-data-write - OK, client, you're done. Since copy1 and copy2 are on completely different physical hardware, shouldn't those operations be able to proceed more or less independently? And shouldn't the client be done as soon as the journal is written? I.e.: copy1-journal-write -v- copy1-data-write copy2-journal-write -|- copy1-data-write +- OK, client, you're done If so, shouldn't the effective latency be that of one operation, not four? Plus all the non-trivial overhead for scheduling, LevelDB, network latency, etc. For the getting jackhammered by zillions of clients case, your estimate probably holds more true, because even if writes aren't in the critical path they still happen and sooner or later the drive runs out of IOPs and things start getting in each others' way. But for a single client, single thread case where the cluster is *not* 100% utilized, shouldn't the effective latency be much less? The other thing about this that I don't quite understand, and the thing initially had me questioning whether there was something wrong on the Ceph side is that your estimate is based primarily on the mechanical capabilities of the drives. Yet, in practice, when the Ceph cluster is tapped out for I/O in this situation, iostat says none of the physical drives are more than 10-20% busy and doing 10-20 IOPs to write a couple of MB/sec. And those are the loaded ones at any given time. Many are 10%. In fact, *none* of the hardware on the Ceph side is anywhere close to fully utilized. If the performance of this cluster is limited by its hardware, shouldn't there be some evidence of that somewhere? To illustrate, I marked a physical drive out and waited for things to settle down, then ran fio on the physical drive (128KB randwrite numjobs=1 iodepth=1). It yields a very different picture of the drive's physical limits. The drive during maxxed out client writes: Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdl 0.00 0.204.80 13.4023.60 2505.65 277.94 0.26 14.07 16.08 13.34 6.68 12.16 The same drive under fio: Device: rrqm/s wrqm/s r/s w/srkB/swkB/s avgrq-sz avgqu-sz await r_await w_await svctm %util sdl 0.00 0.000.00 377.50 0.00 48320.00 256.00 0.992.620.002.62 2.62 98.72 You could make the argument that for we are seeing half the throughput on the same test because ceph is write-doubling (journal+data) and the reason no drive is highly utilized is because the load is being spread out. So each of 28 drives actually is being maxed out, but only 3.5% of the time, leading to low apparent utilization because the measurement interval is too
Re: [ceph-users] Ceph on Solaris / Illumos
Performance on ZFS on Linux (ZoL) seems to be fine, as long as you use the CEPH generic filesystem implementation (writeahead) and not the specific CEPH ZFS implementation, CoW snapshoting that CEPH does with ZFS support compiled in absolutely kills performance. I suspect the same would go with CEPH on Illumos on ZFS. Otherwise it is comparable to XFS in my own testing once tweaked. There are a few oddities/quirks with ZFS performance that need to be tweaked when using it with CEPH, and yea enabling SA on xattr is one of them. 1. ZFS recordsize - The ZFS sector size, known as within ZFS as the recordsize is technically dynamic. It only enforces the maximum size, however the way CEPH writes and reads from objects (when working with smaller blocks, let's say 4k or 8k via rbd) with default settings seems to be affected by the recordsize. With the default 128K I've found lower IOPS and higher latency. Setting the recordsize too low will inflate various ZFS metadata, so it needs to be balanced against how your CEPH pool will be used. For rbd pools(where small block performance may be important) a recordsize of 32K seems to be a good balance. For pure large object based use (rados, etc) the 128K default is fine, throughput is high(small block performance isn't important here). See following links for more info about recordsize: https://blogs.oracle.com/roch/entry/tuning_zfs_recordsize and https://www.joyent.com/blog/bruning-questions-zfs-record-size 2. XATTR - I didn't do much testing here, I've read that if you do not set xattr = sa on ZFS you will get poor performance. There were also stability issues in the past with xattr = sa on ZFS though it seems all resolved now and I have not encountered any issues myself. I'm unsure what the default setting is here, I always enable it. Make sure you enable and set xattr = sa on ZFS. 3. ZIL(ZFS Intent Log, also known as the slog) is a MUST (even with a separate ceph journal) - It appears that while the ceph journal offloads/absorbs writes nicely and boosts performance, it does not consolidate writes enough for ZFS. Without a ZIL/SLOG your performance will be very sawtooth like (jumpy, stutter, aka fast then slow, fast than slow over a period of 10-15 seconds). In theory tweaking the various ZFS TXG sync settings might work, but it is overly complicated to maintain and likely would only apply to the specific underlying disk model. Disabling sync also resolves this, though you'll lose the last TXG on a power failure - this might be okay with CEPH, but since I'm unsure I'll just assume it is not. IMHO avoid too much evil tuning, just add a ZIL/SLOG. 4. ZIL/SLOG + on-device ceph journal vs ZIL/SLOG + separate ceph journal - Performance is very similar, if you have a ZIL/SLOG you could easily get away without a separate ceph journal and leave it on the device/ZFS dataset. HOWEVER this causes HUGE amounts of fragmentation due to the CoW nature. After only a few days usage, performance tanked with the ceph journal on the same device. I did find that if you partition and share device/SSD between both ZIL/SLOG and a separate ceph journal, the resulting performance is about the same in pure throughput/iops, though latency is slightly higher. This is what I do in my test cluster. 5. Fragmentation - once you hit around 80-90% disk usage your performance will start to slow down due to fragmentation. This isn't due to CEPH, it’s a known ZFS quirk due to its CoW nature. Unfortunately there is no defrag in ZFS, and likely never will be (the mythical block point rewrite unicorn you'll find people talking about). There is one way to delay it and possibly avoid it however, enable metaslab_debug, this will put the ZFS spacemaps in memory, allowing ZFS to make better placements during CoW operations, but it does use more memory. See the following links for more detail about spacemaps and fragmentation: http://blog.delphix.com/uday/2013/02/19/78/ and http://serverfault.com/a/556892 and http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg45408.html There's alot more to ZFS and things-to-know than that (L2ARC uses ARC metadata space, dedupe uses ARC metadata space, etc), but as far as CEPH is cocearned the above is a good place to start. ZFS IMHO is a great solution, but it requires some time and effort to do it right. Cheers, Michal Kozanecki | Linux Administrator | E: mkozane...@evertz.com -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark Nelson Sent: April-15-15 12:22 PM To: Jake Young Cc: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Ceph on Solaris / Illumos On 04/15/2015 10:36 AM, Jake Young wrote: On Wednesday, April 15, 2015, Mark Nelson mnel...@redhat.com mailto:mnel...@redhat.com wrote: On 04/15/2015 08:16 AM, Jake Young wrote: Has anyone compiled ceph (either osd or client) on a Solaris based OS
Re: [ceph-users] full ssd setup preliminary hammer bench
Any quick write performance data? Michal Kozanecki | Linux Administrator | E: mkozane...@evertz.com -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Alexandre DERUMIER Sent: April-17-15 11:38 AM To: Mark Nelson; ceph-users Subject: [ceph-users] full ssd setup preliminary hammer bench Hi Mark, I finally got my hardware for my production full ssd cluster. Here a first preliminary bench. (1osd). I got around 45K iops with randread 4K with a small 10GB rbd volume I'm pretty happy because I don't see anymore huge cpu difference between krbd lirbd. In my previous bench I was using debian wheezy as client, now it's a centos 7.1, so maybe something is different (glibc,...). I'm planning to do big benchmark centos vs ubuntu vs debian, client server, to compare. I have 18 osd ssd for the benchmarks. results : rand 4K : 1 osd - fio + librbd: iops: 45.1K clat percentiles (usec): | 1.00th=[ 358], 5.00th=[ 406], 10.00th=[ 446], 20.00th=[ 556], | 30.00th=[ 676], 40.00th=[ 1048], 50.00th=[ 1192], 60.00th=[ 1304], | 70.00th=[ 1400], 80.00th=[ 1496], 90.00th=[ 1624], 95.00th=[ 1720], | 99.00th=[ 1880], 99.50th=[ 1928], 99.90th=[ 2064], 99.95th=[ 2128], | 99.99th=[ 2512] cpu server : 89.1 iddle cpu client : 92,5 idle fio + krbd -- iops:47.5K clat percentiles (usec): | 1.00th=[ 620], 5.00th=[ 636], 10.00th=[ 644], 20.00th=[ 652], | 30.00th=[ 668], 40.00th=[ 676], 50.00th=[ 684], 60.00th=[ 692], | 70.00th=[ 708], 80.00th=[ 724], 90.00th=[ 756], 95.00th=[ 820], | 99.00th=[ 1004], 99.50th=[ 1032], 99.90th=[ 1144], 99.95th=[ 1448], | 99.99th=[ 2224] cpu server : 92.4 idle cpu client : 96,8 idle hardware (ceph node client node): --- ceph : hammer os : centos 7.1 2 x 10cores Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz 64GB ram 2 x intel s3700 100GB : raid1: os + monitor 6 x intel s3500 160GB : osds 2x10gb mellanox connect-x3 (lacp) network --- mellanox sx1012 with breakout cables (10GB) centos tunning: --- -noop scheduler -tune-adm profile latency-performance ceph.conf - auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true osd pool default min size = 1 debug lockdep = 0/0 debug context = 0/0 debug crush = 0/0 debug buffer = 0/0 debug timer = 0/0 debug journaler = 0/0 debug osd = 0/0 debug optracker = 0/0 debug objclass = 0/0 debug filestore = 0/0 debug journal = 0/0 debug ms = 0/0 debug monc = 0/0 debug tp = 0/0 debug auth = 0/0 debug finisher = 0/0 debug heartbeatmap = 0/0 debug perfcounter = 0/0 debug asok = 0/0 debug throttle = 0/0 osd_op_threads = 5 filestore_op_threads = 4 osd_op_num_threads_per_shard = 1 osd_op_num_shards = 10 filestore_fd_cache_size = 64 filestore_fd_cache_shards = 32 ms_nocrc = true ms_dispatch_throttle_bytes = 0 cephx sign messages = false cephx require signatures = false [client] rbd_cache = false rand 4K : rbd volume size: 10GB (data in osd node buffer - no access to disk) -- fio + librbd [global] ioengine=rbd clientname=admin pool=pooltest rbdname=rbdtest invalidate=0 rw=randread direct=1 bs=4k numjobs=2 group_reporting=1 iodepth=32 fio + krbd --- [global] ioengine=aio invalidate=1# mandatory rw=randread bs=4K direct=1 numjobs=2 group_reporting=1 size=10G iodepth=32 filename=/dev/rbd0 (noop scheduler) ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] use ZFS for OSDs
pgs inconsistent; 2 scrub errors; noout flag(s) set monmap e2: 3 mons at {ceph01=10.10.10.101:6789/0,ceph02=10.10.10.102:6789/0,ceph03=10.10.10.103:6789/0}, election epoch 146, quorum 0,1,2 ceph01,ceph02,ceph03 osdmap e3178: 3 osds: 3 up, 3 in flags noout pgmap v890949: 392 pgs, 6 pools, 931 GB data, 249 kobjects 1756 GB used, 704 GB / 2460 GB avail 2 active+clean+inconsistent 391 active+clean client io 0 B/s rd, 7920 B/s wr, 3 op/s 3. Repair must be manually kicked off [root@client01 ~]# ceph pg repair 5.18 instructing pg 5.18 on osd.0 to repair [root@client01 ~]# ceph health detail HEALTH_WARN 1 pgs repair; noout flag(s) set pg 5.25 is active+clean+inconsistent, acting [1,0,2] pg 5.18 is active+clean+scrubbing+deep+repair, acting [2,0,1] /var/log/ceph/ceph-osd.2.log 2015-04-09 13:30:01.609756 7fcbb163a700 -1 log_channel(default) log [ERR] : 5.18 shard 1: soid cd635018/rbd_data.93d1f74b0dc51.18ee/head//5 candidate had a read error, digest 1835988768 != known digest 473354757 2015-04-09 13:30:41.834465 7fcbb1e3b700 -1 log_channel(default) log [ERR] : 5.18 repair 0 missing, 1 inconsistent objects 2015-04-09 13:30:41.834479 7fcbb1e3b700 -1 log_channel(default) log [ERR] : 5.18 repair 1 errors, 1 fixed /var/log/ceph/ceph-osd.1.log 2015-04-09 13:30:47.952742 7fe10b3c5700 -1 log_channel(default) log [ERR] : 5.25 shard 1: soid 73eb0125/rbd_data.5315b2ae8944a.5348/head//5 candidate had a read error, digest 1522345897 != known digest 1180025616 2015-04-09 13:31:23.389095 7fe10b3c5700 -1 log_channel(default) log [ERR] : 5.25 repair 0 missing, 1 inconsistent objects 2015-04-09 13:31:23.389112 7fe10b3c5700 -1 log_channel(default) log [ERR] : 5.25 repair 1 errors, 1 fixed Conclusion; ZFS compression works GREAT, between 30-50% compression depending on data (I was getting around 30-35% with only OS images, once I loaded on real test data (SVN/GIT/etc) this increased to 50%). ZFS dedupe doesn't seem to get you much at least with how CEPH works. Maybe due to my recordsize (32K)? ZFS/CEPH bitrot/corruption protection isn't fully automated but still pretty damn good in my opinion, an improvement over silent bitrot of coin tossing of other filesystems if CEPH somehow detects an error. CEPH attempts accessing the file, ZFS detects error and basically kills access to the file. CEPH detects this as a read error and kicks off a scrub on the PG. PG repair does not seem to happen automatically, however when manually kicked off it succeeds. Let me know if there's anything else or any questions people have while I have this test cluster running. Cheers, Michal Kozanecki | Linux Administrator | mkozane...@evertz.com -Original Message- From: Christian Balzer [mailto:ch...@gol.com] Sent: November-01-14 4:43 AM To: ceph-users Cc: Michal Kozanecki Subject: Re: [ceph-users] use ZFS for OSDs On Fri, 31 Oct 2014 16:32:49 + Michal Kozanecki wrote: I'll test this by manually inducing corrupted data to the ZFS filesystem and report back how ZFS+ceph interact during a detected file failure/corruption, how it recovers and any manual steps required, and report back with the results. Looking forward to that. As for compression, using lz4 the CPU impact is around 5-20% depending on load, type of I/O and I/O size, with little-to-no I/O performance impact, and in fact in some cases the I/O performance actually increases. I'm currently looking at a compression ratio on the ZFS datasets of around 30-35% for a data consisting of rbd backed OpenStack KVM VMs. I'm looking at a similar deployment (VM images) and over 30% compression will at least negate the need of ZFS to have at least 20% free space or suffer massive degradation otherwise. CPU usage looks acceptable, however in combination with SSD backed OSDs that's another thing to consider. As in, is it worth to spend X amount of money for faster CPUs and 10-20% space savings or will be another SSD be cheaper? I'm trying to position Ceph against SolidFire, who are claiming 4-10 times data reduction by a combination of compression, deduping and thin provisioning. Without of course quantifying things, like what step gives which reduction based on what sample data. I have not tried any sort of dedupe as it is memory intensive and I only had 24GB of ram on each node. I'll grab some FIO benchmarks and report back. I foresee a massive failure here, despite a huge potential given one use case here where all VMs are basically identical (KSM is very effective with those, too). Why the predicted failure? Several reasons: 1. Deduping is only local, per OSD. That will make a big dent, but with many nearly identical VM images we should still have a quite a bit of identical data per OSD. However... 2. Data alignment. The default RADOS objects making up images are 4MB. Which, given my limited
Re: [ceph-users] Power failure recovery woes
Hi Jeff, What type model drives are you using as OSDs? Any Journals? If so, what model? What does your ceph.conf look like? What sort of load is on the cluster (if it's still online)? What distro/version? Firewall rules set properly? Michal Kozanecki -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jeff Sent: February-17-15 9:17 AM To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Power failure recovery woes Some additional information/questions: Here is the output of ceph osd tree Some of the down OSD's are actually running, but are down. For example osd.1: root 30158 8.6 12.7 1542860 781288 ? Ssl 07:47 4:40 /usr/bin/ceph-osd --cluster=ceph -i 0 -f Is there any way to get the cluster to recognize them as being up? osd-1 has the FAILED assert(last_e.version.version e.version.version) errors. Thanks, Jeff # idweight type name up/down reweight -1 10.22 root default -2 2.72host ceph1 0 0.91osd.0 up 1 1 0.91osd.1 down0 2 0.9 osd.2 down0 -3 1.82host ceph2 3 0.91osd.3 down0 4 0.91osd.4 down0 -4 2.04host ceph3 5 0.68osd.5 up 1 6 0.68osd.6 up 1 7 0.68osd.7 up 1 8 0.68osd.8 down0 -5 1.82host ceph4 9 0.91osd.9 up 1 10 0.91osd.10 down0 -6 1.82host ceph5 11 0.91osd.11 up 1 12 0.91osd.12 up 1 On 2/17/2015 8:28 AM, Jeff wrote: Original Message Subject: Re: [ceph-users] Power failure recovery woes Date: 2015-02-17 04:23 From: Udo Lembke ulem...@polarzone.de To: Jeff j...@usedmoviefinder.com, ceph-users@lists.ceph.com Hi Jeff, is the osd /var/lib/ceph/osd/ceph-2 mounted? If not, does it helps, if you mounted the osd and start with service ceph start osd.2 ?? Udo Am 17.02.2015 09:54, schrieb Jeff: Hi, We had a nasty power failure yesterday and even with UPS's our small (5 node, 12 OSD) cluster is having problems recovering. We are running ceph 0.87 3 of our OSD's are down consistently (others stop and are restartable, but our cluster is so slow that almost everything we do times out). We are seeing errors like this on the OSD's that never run: ERROR: error converting store /var/lib/ceph/osd/ceph-2: (1) Operation not permitted We are seeing errors like these of the OSD's that run some of the time: osd/PGLog.cc: 844: FAILED assert(last_e.version.version e.version.version) common/HeartbeatMap.cc: 79: FAILED assert(0 == hit suicide timeout) Does anyone have any suggestions on how to recover our cluster? Thanks! Jeff ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] use ZFS for OSDs
I'll test this by manually inducing corrupted data to the ZFS filesystem and report back how ZFS+ceph interact during a detected file failure/corruption, how it recovers and any manual steps required, and report back with the results. As for compression, using lz4 the CPU impact is around 5-20% depending on load, type of I/O and I/O size, with little-to-no I/O performance impact, and in fact in some cases the I/O performance actually increases. I'm currently looking at a compression ratio on the ZFS datasets of around 30-35% for a data consisting of rbd backed OpenStack KVM VMs. I have not tried any sort of dedupe as it is memory intensive and I only had 24GB of ram on each node. I'll grab some FIO benchmarks and report back. Cheers, -Original Message- From: Christian Balzer [mailto:ch...@gol.com] Sent: October-30-14 4:12 AM To: ceph-users Cc: Michal Kozanecki Subject: Re: [ceph-users] use ZFS for OSDs On Wed, 29 Oct 2014 15:32:57 + Michal Kozanecki wrote: [snip] With Ceph handling the redundancy at the OSD level I saw no need for using ZFS mirroring or zraid, instead if ZFS detects corruption instead of self-healing it sends a read failure of the pg file to ceph, and then ceph's scrub mechanisms should then repair/replace the pg file using a good replica elsewhere on the cluster. ZFS + ceph are a beautiful bitrot fighting match! Could you elaborate on that? AFAIK Ceph currently has no way to determine which of the replicas is good, one such failed PG object will require you to do a manual repair after the scrub and hope that two surviving replicas (assuming a size of 3) are identical. If not, start tossing a coin. Ideally Ceph would have a way to know what happened (as in, it's a checksum and not a real I/O error) and do a rebuild of that object itself. On an other note, have you done any tests using the ZFS compression? I'm wondering what the performance impact and efficiency are. Christian -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Fusion Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] use ZFS for OSDs
Forgot to mention, when you create the ZFS/ZPOOL datasets, make sure to set the xattar setting to sa e.g. zpool create osd01 -O xattr=sa -O compression=lz4 sdb OR if zpool/zfs dataset already created zfs set xattr=sa osd01 Cheers -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Michal Kozanecki Sent: October-29-14 11:33 AM To: Kenneth Waegeman; ceph-users Subject: Re: [ceph-users] use ZFS for OSDs Hi Kenneth, I run a small ceph test cluster using ZoL (ZFS on Linux) ontop of CentOS 7, so I'll try and answer any questions. :) Yes, ZFS writeparallel support is there, but NOT compiled in by default. You'll need to compile it with --with-zlib, but that by itself will fail to compile the ZFS support as I found out. You need to ensure you have ZoL installed and working, and then pass the location of libzfs to ceph at compile time. Personally I just set my environment variables before compiling like so; ldconfig export LIBZFS_LIBS=/usr/include/libzfs/ export LIBZFS_CFLAGS=-I/usr/include/libzfs -I/usr/include/libspl However, the writeparallel performance isn't all that great. The writeparallel mode makes heavy use of ZFS's (and BtrFS's for that matter) snapshotting capability, and the snap performance on ZoL, at least when I last tested it, is pretty terrible. You lose any performance benefits you gain with writeparallel to the poor snap performance. If you decide that you don't need writeparallel mode you, can use the prebuilt packages (or compile with default options) without issue. Ceph (without zlib support compiled in) will detect ZFS as a generic/ext4 file system and work accordingly. As far as performance tweaking, ZIL, write journals and etc, I found that the performance difference between using a ZIL vs ceph write journal is about the same. I also found that doing both (ZIL AND writejournal) didn't give me much of a performance benefit. In my small test cluster I decided after testing to forego the ZIL and only use a SSD backed ceph write journal on each OSD, with each OSD being a single ZFS dataset/vdev(no zraid or mirroring). With Ceph handling the redundancy at the OSD level I saw no need for using ZFS mirroring or zraid, instead if ZFS detects corruption instead of self-healing it sends a read failure of the pg file to ceph, and then ceph's scrub mechanisms should then repair/replace the pg file using a good replica elsewhere on the cluster. ZFS + ceph are a beautiful bitrot fighting match! Let me know if there's anything else I can answer. Cheers -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Kenneth Waegeman Sent: October-29-14 6:09 AM To: ceph-users Subject: [ceph-users] use ZFS for OSDs Hi, We are looking to use ZFS for our OSD backend, but I have some questions. My main question is: Does Ceph already supports the writeparallel mode for ZFS ? (as described here: http://www.sebastien-han.fr/blog/2013/12/02/ceph-performance-interesting-things-going-on/) I've found this, but I suppose it is outdated: https://wiki.ceph.com/Planning/Blueprints/Emperor/osd%3A_ceph_on_zfs Should Ceph be build with ZFS support? I found a --with-zfslib option somewhere, but can someone verify this, or better has instructions for it?:-) What parameters should be tuned to use this? I found these : filestore zfs_snap = 1 journal_aio = 0 journal_dio = 0 Are there other things we need for it? Many thanks!! Kenneth ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] use ZFS for OSDs
Hi Stijn, Yes, on my cluster I am running; CentOS 7, ZoL 0.6.3, Ceph 80.5. Cheers -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Stijn De Weirdt Sent: October-29-14 3:49 PM To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] use ZFS for OSDs hi michal, thanks for the info. we will certainly try it and see if we come to the same conclusions ;) one small detail: since you were using centos7, i'm assuming you were using ZoL 0.6.3? stijn On 10/29/2014 08:03 PM, Michal Kozanecki wrote: Forgot to mention, when you create the ZFS/ZPOOL datasets, make sure to set the xattar setting to sa e.g. zpool create osd01 -O xattr=sa -O compression=lz4 sdb OR if zpool/zfs dataset already created zfs set xattr=sa osd01 Cheers -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Michal Kozanecki Sent: October-29-14 11:33 AM To: Kenneth Waegeman; ceph-users Subject: Re: [ceph-users] use ZFS for OSDs Hi Kenneth, I run a small ceph test cluster using ZoL (ZFS on Linux) ontop of CentOS 7, so I'll try and answer any questions. :) Yes, ZFS writeparallel support is there, but NOT compiled in by default. You'll need to compile it with --with-zlib, but that by itself will fail to compile the ZFS support as I found out. You need to ensure you have ZoL installed and working, and then pass the location of libzfs to ceph at compile time. Personally I just set my environment variables before compiling like so; ldconfig export LIBZFS_LIBS=/usr/include/libzfs/ export LIBZFS_CFLAGS=-I/usr/include/libzfs -I/usr/include/libspl However, the writeparallel performance isn't all that great. The writeparallel mode makes heavy use of ZFS's (and BtrFS's for that matter) snapshotting capability, and the snap performance on ZoL, at least when I last tested it, is pretty terrible. You lose any performance benefits you gain with writeparallel to the poor snap performance. If you decide that you don't need writeparallel mode you, can use the prebuilt packages (or compile with default options) without issue. Ceph (without zlib support compiled in) will detect ZFS as a generic/ext4 file system and work accordingly. As far as performance tweaking, ZIL, write journals and etc, I found that the performance difference between using a ZIL vs ceph write journal is about the same. I also found that doing both (ZIL AND writejournal) didn't give me much of a performance benefit. In my small test cluster I decided after testing to forego the ZIL and only use a SSD backed ceph write journal on each OSD, with each OSD being a single ZFS dataset/vdev(no zraid or mirroring). With Ceph handling the redundancy at the OSD level I saw no need for using ZFS mirroring or zraid, instead if ZFS detects corruption instead of self-healing it sends a read failure of the pg file to ceph, and then ceph's scrub mechanisms should then repair/replace the pg file using a good replica elsewhere on the cluster. ZFS + ceph are a beautiful bitrot fighting match! Let me know if there's anything else I can answer. Cheers -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Kenneth Waegeman Sent: October-29-14 6:09 AM To: ceph-users Subject: [ceph-users] use ZFS for OSDs Hi, We are looking to use ZFS for our OSD backend, but I have some questions. My main question is: Does Ceph already supports the writeparallel mode for ZFS ? (as described here: http://www.sebastien-han.fr/blog/2013/12/02/ceph-performance-interesti ng-things-going-on/) I've found this, but I suppose it is outdated: https://wiki.ceph.com/Planning/Blueprints/Emperor/osd%3A_ceph_on_zfs Should Ceph be build with ZFS support? I found a --with-zfslib option somewhere, but can someone verify this, or better has instructions for it?:-) What parameters should be tuned to use this? I found these : filestore zfs_snap = 1 journal_aio = 0 journal_dio = 0 Are there other things we need for it? Many thanks!! Kenneth ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Ceph on RHEL 7 with multiple OSD's
Network issue maybe? Have you checked your firewall settings? Iptables changed a bit in EL7 and might of broken any rules your normally try and use, try flushing the rules (iptables -F) and see if that fixes things, if you then you'll need to fix your firewall rules. I ran into a similar issue on EL7 where the OSD's appeared up and in, but were stuck in peering which was due to a few ports being blocked. Cheers -Original Message- From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of BG Sent: September-09-14 6:05 AM To: ceph-users@lists.ceph.com Subject: Re: [ceph-users] Ceph on RHEL 7 with multiple OSD's Loic Dachary loic@... writes: Hi, It it looks like your osd.0 is down and you only have one osd left (osd.1) which would explain why the cluster cannot get to a healthy state. The size 2 in pool 0 'data' replicated size 2 ... means the pool needs at least two OSDs up to function properly. Do you know why the osd.0 is not up ? Cheers I've been trying unsuccessfully to get this up and running since. I've added another OSD but still can't get to active + clean state. I'm not even sure if the problems I'm having are related to the OS version but I'm running out of ideas and unless somebody here can spot something obvious in the logs below I'm going to try rolling back to CentOS 6. $ echo HEALTH ceph health echo STATUS ceph status echo OSD_DUMP ceph osd dump HEALTH HEALTH_WARN 129 pgs peering; 129 pgs stuck unclean STATUS cluster f68332e4-1081-47b8-9b22-e5f3dc1f4521 health HEALTH_WARN 129 pgs peering; 129 pgs stuck unclean monmap e1: 1 mons at {hp09=10.119.16.14:6789/0}, election epoch 2, quorum 0 hp09 osdmap e43: 3 osds: 3 up, 3 in pgmap v61: 192 pgs, 3 pools, 0 bytes data, 0 objects 15469 MB used, 368 GB / 383 GB avail 129 peering 63 active+clean OSD_DUMP epoch 43 fsid f68332e4-1081-47b8-9b22-e5f3dc1f4521 created 2014-09-09 10:42:35.490711 modified 2014-09-09 10:47:25.077178 flags pool 0 'data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool crash_replay_interval 45 stripe_width 0 pool 1 'metadata' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0 pool 2 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0 max_osd 3 osd.0 up in weight 1 up_from 4 up_thru 42 down_at 0 last_clean_interval [0,0) 10.119.16.14:6800/24988 10.119.16.14:6801/24988 10.119.16.14:6802/24988 10.119.16.14:6803/24988 exists,up 63f3f351-eccc-4a98-8f18-e107bd33f82b osd.1 up in weight 1 up_from 38 up_thru 42 down_at 36 last_clean_interval [7,37) 10.119.16.15:6800/22999 10.119.16.15:6801/4022999 10.119.16.15:6802/4022999 10.119.16.15:6803/4022999 exists,up 8e1c029d-ebfb-4a8d-b567-ee9cd9ebd876 osd.2 up in weight 1 up_from 42 up_thru 42 down_at 40 last_clean_interval [11,41) 10.119.16.16:6800/25605 10.119.16.16:6805/5025605 10.119.16.16:6806/5025605 10.119.16.16:6807/5025605 exists,up 5d398bba-59f5-41f8-9bd6-aed6a0204656 Sample of warnings from monitor log: 2014-09-09 10:51:10.636325 7f75037d0700 1 mon.hp09@0(leader).osd e72 prepare_failure osd.1 10.119.16.15:6800/22999 from osd.2 10.119.16.16:6800/25605 is reporting failure:1 2014-09-09 10:51:10.636343 7f75037d0700 0 log [DBG] : osd.1 10.119.16.15:6800/22999 reported failed by osd.2 10.119.16.16:6800/25605 Sample of warnings from osd.2 log: 2014-09-09 10:44:13.723714 7fb828c57700 -1 osd.2 18 heartbeat_check: no reply from osd.1 ever on either front or back, first ping sent 2014-09-09 10:43:30.437170 (cutoff 2014-09-09 10:43:53.723713) 2014-09-09 10:44:13.724883 7fb81f2f9700 0 log [WRN] : map e19 wrongly marked me down 2014-09-09 10:44:13.726104 7fb81f2f9700 0 osd.2 19 crush map has features 1107558400, adjusting msgr requires for mons 2014-09-09 10:44:13.726741 7fb811edb700 0 -- 10.119.16.16:0/25605 10.119.16.15:6806/1022999 pipe(0x3171900 sd=34 :0 s=1 pgs=0 cs=0 l=1 c=0x3ad8580).fault ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] NAS on RBD
Hi Blair! On 9 September 2014 08:47, Blair Bethwaite blair.bethwa...@gmail.com wrote: Hi Dan, Thanks for sharing! On 9 September 2014 20:12, Dan Van Der Ster daniel.vanders...@cern.ch wrote: We do this for some small scale NAS use-cases, with ZFS running in a VM with rbd volumes. The performance is not great (especially since we throttle the IOPS of our RBD). We also tried a few kRBD / ZFS servers with an SSD ZIL — the SSD solves any performance problem we ever had with ZFS on RBD. That's good to hear. My limited experience doing this on a smaller Ceph cluster (and without any SSD journals or cache devices for ZFS head) points to write latency being an immediate issue, decent PCIe SLC SSD devices should pretty much sort that out given the cluster itself has plenty of write throughput available. Then there's further MLC devices for L2ARC - not sure yet but guessing metadata heavy datasets might require primarycache=metadata and rely of L2ARC for data cache. And all this should get better in the medium term with performance improvements and RDMA capability (we're building this with that option in the hole). I'd love to go back and forth with you privately or on one of the ZFS mailing-lists if you want to discuss ZFS tuning in depth, but I want to just mention that setting primarycache=metadata will also cause the L2ARC to ONLY store and accelerate metadata as well(despite whatever secondarycache is set to). I believe this is something that the ZFS developers are looking to improve eventually but as-is, currently that’s how it works (L2ARC only contains what was pushed out of the main in-memory ARC). I would say though that this setup is rather adventurous. ZoL is not rock solid — we’ve had a few lockups in testing, all of which have been fixed in the latest ZFS code in git (my colleague in CC could elaborate if you’re interested). Hmm okay, that's not great. The only problem I've experienced thus far is when the ZoL repos stopped providing DKMS and borked an upgrade for me until I figured out what had happened and cleaned up the old .ko files. So yes, interested to hear elaboration on that. You mentioned in one of your other emails that if you deployed this idea of a ZFS NFS server, you'd do it inside a KVM VM and make use of librbd rather than krbd. If you're worried about ZoL stability and feel comfortable going outside Linux, you could always go with a *BSD or Illumos distro where ZFS support is much more stable/solid. In any case I haven't had any major show stopping issues with ZoL myself and I use it heavily. Still, unless you're really comfortable with ZoL or *BSD/Illumos(as I am), I'd likely recommend looking into other solutions. One thing I’m not comfortable with is the idea of ZFS checking the data in addition to Ceph. Sure, ZFS will tell us if there is a checksum error, but without any redundancy at the ZFS layer there will be no way to correct that error. Of course, the hope is that RADOS will ensure 100% data consistency, but what happens if not?... The ZFS checksumming would tell us if there has been any corruption, which as you've pointed out shouldn't happen anyway on top of Ceph. Just want to quickly address this, someone correct me if I'm wrong, but IIRC even with replica value of 3 or more, ceph does not(currently) have any intelligence when it detects a corrupted/incorrect PG, it will always replace/repair the PG with whatever data is in the primary, meaning that if the primary PG is the one that’s corrupted/bit-rotted/incorrect, it will replace the good replicas with the bad. But if we did have some awful disaster scenario where that happened then we'd be restoring from tape, and it'd sure be good to know which files actually needed restoring. I.e., if we lost a single PG at the Ceph level then we don't want to have to blindly restore the whole zpool or dataset. Personally, I think you’re very brave to consider running 2PB of ZoL on RBD. If I were you I would seriously evaluate the CephFS option. It used to be on the roadmap for ICE 2.0 coming out this fall, though I noticed its not there anymore (??!!!). Yeah, it's very disappointing that this was silently removed. And it's particularly concerning that this happened post RedHat acquisition. I'm an ICE customer and sure would have liked some input there for exactly the reason we're discussing. I'm looking forward to CephFS as well, and I agree, it's somewhat concerning that it happened post RedHat acquisition. I'm hoping RedHat pours more resources into InkTank and ceph, and not instead leach resources away from them. Anyway I would say that ZoL on kRBD is not necessarily a more stable solution than CephFS. Even Gluster striped on top of RBD would probably be more stable than ZoL on RBD. If we really have to we'll just run Gluster natively instead (or perhaps XFS on RBD as the option before that) - the hardware