Re: [ceph-users] Ceph OSDs with bcache experience

2015-11-05 Thread Michal Kozanecki
Why did you guys go with partitioning the SSD for ceph journals, instead of 
just using the whole SSD for bcache and leaving the journal on the filesystem 
(which itself is ontop bcache)? Was there really a benefit to separating the 
journals from the bcache fronted HDDs?

I ask because it has been shown in the past that separating the journal on SSD 
based pools doesn't really do much.

Michal Kozanecki | Linux Administrator | mkozane...@evertz.com


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Wido 
den Hollander
Sent: October-28-15 5:49 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph OSDs with bcache experience



On 21-10-15 15:30, Mark Nelson wrote:
> 
> 
> On 10/21/2015 01:59 AM, Wido den Hollander wrote:
>> On 10/20/2015 07:44 PM, Mark Nelson wrote:
>>> On 10/20/2015 09:00 AM, Wido den Hollander wrote:
>>>> Hi,
>>>>
>>>> In the "newstore direction" thread on ceph-devel I wrote that I'm 
>>>> using bcache in production and Mark Nelson asked me to share some details.
>>>>
>>>> Bcache is running in two clusters now that I manage, but I'll keep 
>>>> this information to one of them (the one at PCextreme behind CloudStack).
>>>>
>>>> In this cluster has been running for over 2 years now:
>>>>
>>>> epoch 284353
>>>> fsid 0d56dd8f-7ae0-4447-b51b-f8b818749307
>>>> created 2013-09-23 11:06:11.819520
>>>> modified 2015-10-20 15:27:48.734213
>>>>
>>>> The system consists out of 39 hosts:
>>>>
>>>> 2U SuperMicro chassis:
>>>> * 80GB Intel SSD for OS
>>>> * 240GB Intel S3700 SSD for Journaling + Bcache
>>>> * 6x 3TB disk
>>>>
>>>> This isn't the newest hardware. The next batch of hardware will be 
>>>> more disks per chassis, but this is it for now.
>>>>
>>>> All systems were installed with Ubuntu 12.04, but they are all 
>>>> running
>>>> 14.04 now with bcache.
>>>>
>>>> The Intel S3700 SSD is partitioned with a GPT label:
>>>> - 5GB Journal for each OSD
>>>> - 200GB Partition for bcache
>>>>
>>>> root@ceph11:~# df -h|grep osd
>>>> /dev/bcache02.8T  1.1T  1.8T  38% /var/lib/ceph/osd/ceph-60
>>>> /dev/bcache12.8T  1.2T  1.7T  41% /var/lib/ceph/osd/ceph-61
>>>> /dev/bcache22.8T  930G  1.9T  34% /var/lib/ceph/osd/ceph-62
>>>> /dev/bcache32.8T  970G  1.8T  35% /var/lib/ceph/osd/ceph-63
>>>> /dev/bcache42.8T  814G  2.0T  30% /var/lib/ceph/osd/ceph-64
>>>> /dev/bcache52.8T  915G  1.9T  33% /var/lib/ceph/osd/ceph-65
>>>> root@ceph11:~#
>>>>
>>>> root@ceph11:~# lsb_release -a
>>>> No LSB modules are available.
>>>> Distributor ID:Ubuntu
>>>> Description:Ubuntu 14.04.3 LTS
>>>> Release:14.04
>>>> Codename:trusty
>>>> root@ceph11:~# uname -r
>>>> 3.19.0-30-generic
>>>> root@ceph11:~#
>>>>
>>>> "apply_latency": {
>>>>   "avgcount": 2985023,
>>>>   "sum": 226219.891559000
>>>> }
>>>>
>>>> What did we notice?
>>>> - Less spikes on the disk
>>>> - Lower commit latencies on the OSDs
>>>> - Almost no 'slow requests' during backfills
>>>> - Cache-hit ratio of about 60%
>>>>
>>>> Max backfills and recovery active are both set to 1 on all OSDs.
>>>>
>>>> For the next generation hardware we are looking into using 3U 
>>>> chassis with 16 4TB SATA drives and a 1.2TB NVM-E SSD for bcache, 
>>>> but we haven't tested those yet, so nothing to say about it.
>>>>
>>>> The current setup is 200GB of cache for 18TB of disks. The new 
>>>> setup will be 1200GB for 64TB, curious to see what that does.
>>>>
>>>> Our main conclusion however is that it does smoothen the 
>>>> I/O-pattern towards the disks and that gives a overall better 
>>>> response of the disks.
>>>
>>> Hi Wido, thanks for the big writeup!  Did you guys happen to do any 
>>> benchmarking?  I think Xiaoxi looked at flashcache a while back but 
>>> had mixed results if I remember right.  It would be interesting to 
>>> know how bcache is affecting performance in different scenarios.
>>>
>>
>> No, we didn't do any benchmarking. Initially this clu

Re: [ceph-users] Having trouble getting good performance

2015-04-26 Thread Michal Kozanecki
Quick correction/clarification about ZFS and large blocks - ZFS can and will 
write in 1MB or larger blocks but only with the latest versions with large 
block support enabled (which I am not sure if ZoL has), by default block 
aggregation is limited to 128KB. The rest of my post (about multiple vdevs, 
slog, etc) stands.

https://reviews.csiden.org/r/51/
https://www.illumos.org/issues/5027

Cheers,

Michal Kozanecki | Linux Administrator | E: mkozane...@evertz.com


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Michal 
Kozanecki
Sent: April-24-15 5:03 PM
To: J David; Nick Fisk
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Having trouble getting good performance

The ZFS recordsize does NOT equal the size of the write to disk, ZFS will write 
to disk whatever size it feels is optimal. During a sequential write ZFS will 
easily write in 1MB blocks or greater. 

In a spinning-rust CEPH set up like yours, getting the most out of it will 
require higher io depths. In this case increasing the number of vdevs ZFS sees 
might help. Instead of a single vdev ontop of a single monolithic 32TB rbd 
volume, how about a striped ZFS set up with 8 vdevs ontop of 8 smaller 4TB rbd 
volumes?

Also, what sort of SSD are you using for your ZIL/SLOG? Just like there are 
many bad SSDs for CEPH journal, many of the same performance guidelines apply 
to SIL/SLOG as well.

Cheers,

Michal Kozanecki | Linux Administrator | E: mkozane...@evertz.com


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of J David
Sent: April-24-15 1:41 PM
To: Nick Fisk
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Having trouble getting good performance

On Fri, Apr 24, 2015 at 10:58 AM, Nick Fisk n...@fisk.me.uk wrote:
 7.2k drives tend to do about 80 iops at 4kb IO sizes, as the IO size 
 increases the number of iops will start to fall. You will probably get 
 around 70 iops for 128kb. But please benchmark your raw disks to get 
 some accurate numbers if needed.

 Next when you use on-disk journals you write 1st to the journal and 
 then write the actual data. There is also a small levelDB write which 
 stores ceph metadata so depending on IO size you will get slightly 
 less than half the native disk performance.

 You then have 2 copies, as Ceph won't ACK until both copies have been 
 written the average latency will tend to stray upwards.

What is the purpose of the journal if Ceph waits for the actual write to 
complete anyway?

I.e. with a hardware raid card with a BBU, the raid card tells the host that 
the data is guaranteed safe as soon as it has been written to the BBU.

Does this also mean that all the writing internal to ceph happens synchronously?

I.e. all these operations are serialized:

copy1-journal-write - copy1-data-write - copy2-journal-write - 
copy2-data-write - OK, client, you're done.

Since copy1 and copy2 are on completely different physical hardware, shouldn't 
those operations be able to proceed more or less independently?  And shouldn't 
the client be done as soon as the journal is written?  I.e.:

copy1-journal-write -v- copy1-data-write copy2-journal-write -|- 
copy1-data-write
 +- OK, client, you're done

If so, shouldn't the effective latency be that of one operation, not four?  
Plus all the non-trivial overhead for scheduling, LevelDB, network latency, etc.

For the getting jackhammered by zillions of clients case, your estimate 
probably holds more true, because even if writes aren't in the critical path 
they still happen and sooner or later the drive runs out of IOPs and things 
start getting in each others' way.  But for a single client, single thread case 
where the cluster is *not* 100% utilized, shouldn't the effective latency be 
much less?

The other thing about this that I don't quite understand, and the thing 
initially had me questioning whether there was something wrong on the Ceph side 
is that your estimate is based primarily on the mechanical capabilities of the 
drives.  Yet, in practice, when the Ceph cluster is tapped out for I/O in this 
situation, iostat says none of the physical drives are more than 10-20% busy 
and doing 10-20 IOPs to write a couple of MB/sec.  And those are the loaded 
ones at any given time.  Many are 10%.  In fact, *none* of the hardware on the 
Ceph side is anywhere close to fully utilized.  If the performance of this 
cluster is limited by its hardware, shouldn't there be some evidence of that 
somewhere?

To illustrate, I marked a physical drive out and waited for things to settle 
down, then ran fio on the physical drive (128KB randwrite
numjobs=1 iodepth=1).  It yields a very different picture of the drive's 
physical limits.

The drive during maxxed out client writes:

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdl   0.00

Re: [ceph-users] Having trouble getting good performance

2015-04-24 Thread Michal Kozanecki
The ZFS recordsize does NOT equal the size of the write to disk, ZFS will write 
to disk whatever size it feels is optimal. During a sequential write ZFS will 
easily write in 1MB blocks or greater. 

In a spinning-rust CEPH set up like yours, getting the most out of it will 
require higher io depths. In this case increasing the number of vdevs ZFS sees 
might help. Instead of a single vdev ontop of a single monolithic 32TB rbd 
volume, how about a striped ZFS set up with 8 vdevs ontop of 8 smaller 4TB rbd 
volumes?

Also, what sort of SSD are you using for your ZIL/SLOG? Just like there are 
many bad SSDs for CEPH journal, many of the same performance guidelines apply 
to SIL/SLOG as well.

Cheers,

Michal Kozanecki | Linux Administrator | E: mkozane...@evertz.com


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of J David
Sent: April-24-15 1:41 PM
To: Nick Fisk
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Having trouble getting good performance

On Fri, Apr 24, 2015 at 10:58 AM, Nick Fisk n...@fisk.me.uk wrote:
 7.2k drives tend to do about 80 iops at 4kb IO sizes, as the IO size 
 increases the number of iops will start to fall. You will probably get 
 around 70 iops for 128kb. But please benchmark your raw disks to get 
 some accurate numbers if needed.

 Next when you use on-disk journals you write 1st to the journal and 
 then write the actual data. There is also a small levelDB write which 
 stores ceph metadata so depending on IO size you will get slightly 
 less than half the native disk performance.

 You then have 2 copies, as Ceph won't ACK until both copies have been 
 written the average latency will tend to stray upwards.

What is the purpose of the journal if Ceph waits for the actual write to 
complete anyway?

I.e. with a hardware raid card with a BBU, the raid card tells the host that 
the data is guaranteed safe as soon as it has been written to the BBU.

Does this also mean that all the writing internal to ceph happens synchronously?

I.e. all these operations are serialized:

copy1-journal-write - copy1-data-write - copy2-journal-write - 
copy2-data-write - OK, client, you're done.

Since copy1 and copy2 are on completely different physical hardware, shouldn't 
those operations be able to proceed more or less independently?  And shouldn't 
the client be done as soon as the journal is written?  I.e.:

copy1-journal-write -v- copy1-data-write copy2-journal-write -|- 
copy1-data-write
 +- OK, client, you're done

If so, shouldn't the effective latency be that of one operation, not four?  
Plus all the non-trivial overhead for scheduling, LevelDB, network latency, etc.

For the getting jackhammered by zillions of clients case, your estimate 
probably holds more true, because even if writes aren't in the critical path 
they still happen and sooner or later the drive runs out of IOPs and things 
start getting in each others' way.  But for a single client, single thread case 
where the cluster is *not* 100% utilized, shouldn't the effective latency be 
much less?

The other thing about this that I don't quite understand, and the thing 
initially had me questioning whether there was something wrong on the Ceph side 
is that your estimate is based primarily on the mechanical capabilities of the 
drives.  Yet, in practice, when the Ceph cluster is tapped out for I/O in this 
situation, iostat says none of the physical drives are more than 10-20% busy 
and doing 10-20 IOPs to write a couple of MB/sec.  And those are the loaded 
ones at any given time.  Many are 10%.  In fact, *none* of the hardware on the 
Ceph side is anywhere close to fully utilized.  If the performance of this 
cluster is limited by its hardware, shouldn't there be some evidence of that 
somewhere?

To illustrate, I marked a physical drive out and waited for things to settle 
down, then ran fio on the physical drive (128KB randwrite
numjobs=1 iodepth=1).  It yields a very different picture of the drive's 
physical limits.

The drive during maxxed out client writes:

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdl   0.00 0.204.80   13.4023.60  2505.65
277.94 0.26   14.07   16.08   13.34   6.68  12.16

The same drive under fio:

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdl   0.00 0.000.00  377.50 0.00 48320.00
256.00 0.992.620.002.62   2.62  98.72

You could make the argument that for we are seeing half the throughput on the 
same test because ceph is write-doubling (journal+data) and the reason no drive 
is highly utilized is because the load is being spread out.  So each of 28 
drives actually is being maxed out, but only 3.5% of the time, leading to low 
apparent utilization because the measurement interval is too

Re: [ceph-users] Ceph on Solaris / Illumos

2015-04-17 Thread Michal Kozanecki
Performance on ZFS on Linux (ZoL) seems to be fine, as long as you use the CEPH 
generic filesystem implementation (writeahead) and not the specific CEPH ZFS 
implementation, CoW snapshoting that CEPH does with ZFS support compiled in 
absolutely kills performance. I suspect the same would go with CEPH on Illumos 
on ZFS. Otherwise it is comparable to XFS in my own testing once tweaked. 

There are a few oddities/quirks with ZFS performance that need to be tweaked 
when using it with CEPH, and yea enabling SA on xattr is one of them.

1. ZFS recordsize - The ZFS sector size, known as within ZFS as the 
recordsize is technically dynamic. It only enforces the maximum size, however 
the way CEPH writes and reads from objects (when working with smaller blocks, 
let's say 4k or 8k via rbd) with default settings seems to be affected by the 
recordsize. With the default 128K I've found lower IOPS and higher latency. 
Setting the recordsize too low will inflate various ZFS metadata, so it needs 
to be balanced against how your CEPH pool will be used. 

For rbd pools(where small block performance may be important) a recordsize of 
32K seems to be a good balance. For pure large object based use (rados, etc) 
the 128K default is fine, throughput is high(small block performance isn't 
important here). See following links for more info about recordsize: 
https://blogs.oracle.com/roch/entry/tuning_zfs_recordsize and 
https://www.joyent.com/blog/bruning-questions-zfs-record-size

2. XATTR - I didn't do much testing here, I've read that if you do not set 
xattr = sa on ZFS you will get poor performance. There were also stability 
issues in the past with xattr = sa on ZFS though it seems all resolved now and 
I have not encountered any issues myself. I'm unsure what the default setting 
is here, I always enable it.

Make sure you enable and set xattr = sa on ZFS.

3. ZIL(ZFS Intent Log, also known as the slog) is a MUST (even with a separate 
ceph journal) - It appears that while the ceph journal offloads/absorbs writes 
nicely and boosts performance, it does not consolidate writes enough for ZFS. 
Without a ZIL/SLOG your performance will be very sawtooth like (jumpy, stutter, 
aka fast then slow, fast than slow over a period of 10-15 seconds). 

In theory tweaking the various ZFS TXG sync settings might work, but it is 
overly complicated to maintain and likely would only apply to the specific 
underlying disk model. Disabling sync also resolves this, though you'll lose 
the last TXG on a power failure - this might be okay with CEPH, but since I'm 
unsure I'll just assume it is not. IMHO avoid too much evil tuning, just add a 
ZIL/SLOG.   

4. ZIL/SLOG + on-device ceph journal vs ZIL/SLOG + separate ceph journal - 
Performance is very similar, if you have a ZIL/SLOG you could easily get away 
without a separate ceph journal and leave it on the device/ZFS dataset. HOWEVER 
this causes HUGE amounts of fragmentation due to the CoW nature. After only a 
few days usage, performance tanked with the ceph journal on the same device. 

I did find that if you partition and share device/SSD between both ZIL/SLOG and 
a separate ceph journal, the resulting performance is about the same in pure 
throughput/iops, though latency is slightly higher. This is what I do in my 
test cluster.

5. Fragmentation - once you hit around 80-90% disk usage your performance will 
start to slow down due to fragmentation. This isn't due to CEPH, it’s a known 
ZFS quirk due to its CoW nature. Unfortunately there is no defrag in ZFS, and 
likely never will be (the mythical block point rewrite unicorn you'll find 
people talking about). 

There is one way to delay it and possibly avoid it however, enable 
metaslab_debug, this will put the ZFS spacemaps in memory, allowing ZFS to make 
better placements during CoW operations, but it does use more memory. See the 
following links for more detail about spacemaps and fragmentation: 
http://blog.delphix.com/uday/2013/02/19/78/ and http://serverfault.com/a/556892 
and http://www.mail-archive.com/zfs-discuss@opensolaris.org/msg45408.html 

There's alot more to ZFS and things-to-know than that (L2ARC uses ARC 
metadata space, dedupe uses ARC metadata space, etc), but as far as CEPH is 
cocearned the above is a good place to start. ZFS IMHO is a great solution, but 
it requires some time and effort to do it right.

Cheers,

Michal Kozanecki | Linux Administrator | E: mkozane...@evertz.com


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark 
Nelson
Sent: April-15-15 12:22 PM
To: Jake Young
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph on Solaris / Illumos

On 04/15/2015 10:36 AM, Jake Young wrote:


 On Wednesday, April 15, 2015, Mark Nelson mnel...@redhat.com 
 mailto:mnel...@redhat.com wrote:



 On 04/15/2015 08:16 AM, Jake Young wrote:

 Has anyone compiled ceph (either osd or client) on a Solaris
 based OS

Re: [ceph-users] full ssd setup preliminary hammer bench

2015-04-17 Thread Michal Kozanecki
Any quick write performance data?

Michal Kozanecki | Linux Administrator | E: mkozane...@evertz.com


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Alexandre DERUMIER
Sent: April-17-15 11:38 AM
To: Mark Nelson; ceph-users
Subject: [ceph-users] full ssd setup preliminary hammer bench

Hi Mark,

I finally got my hardware for my production full ssd cluster.

Here a first preliminary bench. (1osd).

I got around 45K iops with randread 4K with a small 10GB rbd volume


I'm pretty happy because I don't see anymore huge cpu difference between krbd 
 lirbd.
In my previous bench I was using debian wheezy as client, now it's a centos 
7.1, so maybe something is different (glibc,...).

I'm planning to do big benchmark centos vs ubuntu vs debian, client  server, 
to compare.
I have 18 osd ssd for the benchmarks.







results : rand 4K : 1 osd
-

fio + librbd: 

iops: 45.1K

clat percentiles (usec):
 |  1.00th=[  358],  5.00th=[  406], 10.00th=[  446], 20.00th=[  556],
 | 30.00th=[  676], 40.00th=[ 1048], 50.00th=[ 1192], 60.00th=[ 1304],
 | 70.00th=[ 1400], 80.00th=[ 1496], 90.00th=[ 1624], 95.00th=[ 1720],
 | 99.00th=[ 1880], 99.50th=[ 1928], 99.90th=[ 2064], 99.95th=[ 2128],
 | 99.99th=[ 2512]

cpu server :  89.1 iddle
cpu client :  92,5 idle

fio + krbd
--
iops:47.5K

clat percentiles (usec):
 |  1.00th=[  620],  5.00th=[  636], 10.00th=[  644], 20.00th=[  652],
 | 30.00th=[  668], 40.00th=[  676], 50.00th=[  684], 60.00th=[  692],
 | 70.00th=[  708], 80.00th=[  724], 90.00th=[  756], 95.00th=[  820],
 | 99.00th=[ 1004], 99.50th=[ 1032], 99.90th=[ 1144], 99.95th=[ 1448],
 | 99.99th=[ 2224]

cpu server :  92.4 idle
cpu client :  96,8 idle




hardware (ceph node  client node):
---
ceph : hammer
os : centos 7.1
2 x 10cores Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz 64GB ram
2 x intel s3700 100GB : raid1: os + monitor
6 x intel s3500 160GB : osds
2x10gb mellanox connect-x3 (lacp)

network
---
mellanox sx1012 with breakout cables (10GB)


centos tunning:
---
-noop scheduler
-tune-adm profile latency-performance

ceph.conf
-
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true


osd pool default min size = 1

debug lockdep = 0/0
debug context = 0/0
debug crush = 0/0
debug buffer = 0/0
debug timer = 0/0
debug journaler = 0/0
debug osd = 0/0
debug optracker = 0/0
debug objclass = 0/0
debug filestore = 0/0
debug journal = 0/0
debug ms = 0/0
debug monc = 0/0
debug tp = 0/0
debug auth = 0/0
debug finisher = 0/0
debug heartbeatmap = 0/0
debug perfcounter = 0/0
debug asok = 0/0
debug throttle = 0/0

osd_op_threads = 5
filestore_op_threads = 4


osd_op_num_threads_per_shard = 1
osd_op_num_shards = 10
filestore_fd_cache_size = 64
filestore_fd_cache_shards = 32
ms_nocrc = true
ms_dispatch_throttle_bytes = 0

cephx sign messages = false
cephx require signatures = false

[client]
rbd_cache = false





rand 4K : rbd volume size: 10GB  (data in osd node buffer - no access to disk)
--
fio + librbd

[global]
ioengine=rbd
clientname=admin
pool=pooltest
rbdname=rbdtest
invalidate=0
rw=randread
direct=1
bs=4k
numjobs=2
group_reporting=1
iodepth=32



fio + krbd
---
[global]
ioengine=aio
invalidate=1# mandatory
rw=randread
bs=4K
direct=1
numjobs=2
group_reporting=1
size=10G

iodepth=32
filename=/dev/rbd0   (noop scheduler)






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] use ZFS for OSDs

2015-04-09 Thread Michal Kozanecki
 pgs inconsistent; 2 scrub errors; noout flag(s) set
 monmap e2: 3 mons at 
{ceph01=10.10.10.101:6789/0,ceph02=10.10.10.102:6789/0,ceph03=10.10.10.103:6789/0},
 election epoch 146, quorum 0,1,2 ceph01,ceph02,ceph03
 osdmap e3178: 3 osds: 3 up, 3 in
flags noout
  pgmap v890949: 392 pgs, 6 pools, 931 GB data, 249 kobjects
1756 GB used, 704 GB / 2460 GB avail
   2 active+clean+inconsistent
 391 active+clean
  client io 0 B/s rd, 7920 B/s wr, 3 op/s



3. Repair must be manually kicked off



[root@client01 ~]# ceph pg repair 5.18
instructing pg 5.18 on osd.0 to repair

[root@client01 ~]# ceph health detail
HEALTH_WARN 1 pgs repair; noout flag(s) set
pg 5.25 is active+clean+inconsistent, acting [1,0,2]
pg 5.18 is active+clean+scrubbing+deep+repair, acting [2,0,1]

/var/log/ceph/ceph-osd.2.log
2015-04-09 13:30:01.609756 7fcbb163a700 -1 log_channel(default) log [ERR] : 
5.18 shard 1: soid cd635018/rbd_data.93d1f74b0dc51.18ee/head//5 
candidate had a read error, digest 1835988768 != known digest 473354757
2015-04-09 13:30:41.834465 7fcbb1e3b700 -1 log_channel(default) log [ERR] : 
5.18 repair 0 missing, 1 inconsistent objects
2015-04-09 13:30:41.834479 7fcbb1e3b700 -1 log_channel(default) log [ERR] : 
5.18 repair 1 errors, 1 fixed

/var/log/ceph/ceph-osd.1.log
2015-04-09 13:30:47.952742 7fe10b3c5700 -1 log_channel(default) log [ERR] : 
5.25 shard 1: soid 73eb0125/rbd_data.5315b2ae8944a.5348/head//5 
candidate had a read error, digest 1522345897 != known digest 1180025616
2015-04-09 13:31:23.389095 7fe10b3c5700 -1 log_channel(default) log [ERR] : 
5.25 repair 0 missing, 1 inconsistent objects
2015-04-09 13:31:23.389112 7fe10b3c5700 -1 log_channel(default) log [ERR] : 
5.25 repair 1 errors, 1 fixed



Conclusion;

ZFS compression works GREAT, between 30-50% compression depending on data (I 
was getting around 30-35% with only OS images, once I loaded on real test data 
(SVN/GIT/etc) this increased to 50%).

ZFS dedupe doesn't seem to get you much at least with how CEPH works. Maybe due 
to my recordsize (32K)?

ZFS/CEPH bitrot/corruption protection isn't fully automated but still pretty 
damn good in my opinion, an improvement over silent bitrot of coin tossing of 
other filesystems if CEPH somehow detects an error. CEPH attempts accessing the 
file, ZFS detects error and basically kills access to the file. CEPH detects 
this as a read error and kicks off a scrub on the PG. PG repair does not seem 
to happen automatically, however when manually kicked off it succeeds. 

Let me know if there's anything else or any questions people have while I have 
this test cluster running.

Cheers,
Michal Kozanecki | Linux Administrator | mkozane...@evertz.com


-Original Message-
From: Christian Balzer [mailto:ch...@gol.com] 
Sent: November-01-14 4:43 AM
To: ceph-users
Cc: Michal Kozanecki
Subject: Re: [ceph-users] use ZFS for OSDs

On Fri, 31 Oct 2014 16:32:49 + Michal Kozanecki wrote:

 I'll test this by manually inducing corrupted data to the ZFS 
 filesystem and report back how ZFS+ceph interact during a detected 
 file failure/corruption, how it recovers and any manual steps 
 required, and report back with the results.
 
Looking forward to that.

 As for compression, using lz4 the CPU impact is around 5-20% depending 
 on load, type of I/O and I/O size, with little-to-no I/O performance 
 impact, and in fact in some cases the I/O performance actually 
 increases. I'm currently looking at a compression ratio on the ZFS 
 datasets of around 30-35% for a data consisting of rbd backed 
 OpenStack KVM VMs.

I'm looking at a similar deployment (VM images) and over 30% compression will 
at least negate the need of ZFS to have at least 20% free space or suffer 
massive degradation otherwise.

CPU usage looks acceptable, however in combination with SSD backed OSDs that's 
another thing to consider.
As in, is it worth to spend X amount of money for faster CPUs and 10-20% space 
savings or will be another SSD be cheaper?

I'm trying to position Ceph against SolidFire, who are claiming 4-10 times data 
reduction by a combination of compression, deduping and thin provisioning. 
Without of course quantifying things, like what step gives which reduction 
based on what sample data.

 I have not tried any sort of dedupe as it is memory intensive and I 
 only had 24GB of ram on each node. I'll grab some FIO benchmarks and 
 report back.
 
I foresee a massive failure here, despite a huge potential given one use case 
here where all VMs are basically identical (KSM is very effective with those, 
too).
Why the predicted failure? Several reasons:

1. Deduping is only local, per OSD. 
That will make a big dent, but with many nearly identical VM images we should 
still have a quite a bit of identical data per OSD. However...

2. Data alignment.
The default RADOS objects making up images are 4MB. Which, given my limited

Re: [ceph-users] Power failure recovery woes

2015-02-17 Thread Michal Kozanecki
Hi Jeff,

What type model drives are you using as OSDs? Any Journals? If so, what model? 
What does your ceph.conf look like? What sort of load is on the cluster (if 
it's still online)? What distro/version? Firewall rules set properly?

Michal Kozanecki


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jeff
Sent: February-17-15 9:17 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Power failure recovery woes

Some additional information/questions:

Here is the output of ceph osd tree

Some of the down OSD's are actually running, but are down. For example 
osd.1:

 root 30158  8.6 12.7 1542860 781288 ?  Ssl 07:47   4:40 
/usr/bin/ceph-osd --cluster=ceph -i 0 -f

  Is there any way to get the cluster to recognize them as being up?  
osd-1 has the FAILED assert(last_e.version.version  e.version.version) 
errors.

Thanks,
  Jeff


# idweight  type name   up/down reweight
-1  10.22   root default
-2  2.72host ceph1
0   0.91osd.0   up  1
1   0.91osd.1   down0
2   0.9 osd.2   down0
-3  1.82host ceph2
3   0.91osd.3   down0
4   0.91osd.4   down0
-4  2.04host ceph3
5   0.68osd.5   up  1
6   0.68osd.6   up  1
7   0.68osd.7   up  1
8   0.68osd.8   down0
-5  1.82host ceph4
9   0.91osd.9   up  1
10  0.91osd.10  down0
-6  1.82host ceph5
11  0.91osd.11  up  1
12  0.91osd.12  up  1

On 2/17/2015 8:28 AM, Jeff wrote:


  Original Message 
 Subject: Re: [ceph-users] Power failure recovery woes
 Date: 2015-02-17 04:23
 From: Udo Lembke ulem...@polarzone.de
 To: Jeff j...@usedmoviefinder.com, ceph-users@lists.ceph.com

 Hi Jeff,
 is the osd /var/lib/ceph/osd/ceph-2 mounted?

 If not, does it helps, if you mounted the osd and start with service 
 ceph start osd.2 ??

 Udo

 Am 17.02.2015 09:54, schrieb Jeff:
 Hi,

 We had a nasty power failure yesterday and even with UPS's our small 
 (5 node, 12 OSD) cluster is having problems recovering.

 We are running ceph 0.87

 3 of our OSD's are down consistently (others stop and are 
 restartable, but our cluster is so slow that almost everything we do times 
 out).

 We are seeing errors like this on the OSD's that never run:

 ERROR: error converting store /var/lib/ceph/osd/ceph-2: (1) 
 Operation not permitted

 We are seeing errors like these of the OSD's that run some of the time:

 osd/PGLog.cc: 844: FAILED assert(last_e.version.version 
 e.version.version)
 common/HeartbeatMap.cc: 79: FAILED assert(0 == hit suicide
 timeout)

 Does anyone have any suggestions on how to recover our cluster?

 Thanks!
   Jeff


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] use ZFS for OSDs

2014-10-31 Thread Michal Kozanecki
I'll test this by manually inducing corrupted data to the ZFS filesystem and 
report back how ZFS+ceph interact during a detected file failure/corruption, 
how it recovers and any manual steps required, and report back with the 
results. 

As for compression, using lz4 the CPU impact is around 5-20% depending on load, 
type of I/O and I/O size, with little-to-no I/O performance impact, and in fact 
in some cases the I/O performance actually increases. I'm currently looking at 
a compression ratio on the ZFS datasets of around 30-35% for a data consisting 
of rbd backed OpenStack KVM VMs. I have not tried any sort of dedupe as it is 
memory intensive and I only had 24GB of ram on each node. I'll grab some FIO 
benchmarks and report back.

Cheers,



-Original Message-
From: Christian Balzer [mailto:ch...@gol.com] 
Sent: October-30-14 4:12 AM
To: ceph-users
Cc: Michal Kozanecki
Subject: Re: [ceph-users] use ZFS for OSDs

On Wed, 29 Oct 2014 15:32:57 + Michal Kozanecki wrote:

[snip]
 With Ceph handling the
 redundancy at the OSD level I saw no need for using ZFS mirroring or 
 zraid, instead if ZFS detects corruption instead of self-healing it 
 sends a read failure of the pg file to ceph, and then ceph's scrub 
 mechanisms should then repair/replace the pg file using a good replica 
 elsewhere on the cluster. ZFS + ceph are a beautiful bitrot fighting 
 match!
 
Could you elaborate on that? 
AFAIK Ceph currently has no way to determine which of the replicas is good, 
one such failed PG object will require you to do a manual repair after the 
scrub and hope that two surviving replicas (assuming a size of
3) are identical. If not, start tossing a coin.
Ideally Ceph would have a way to know what happened (as in, it's a checksum and 
not a real I/O error) and do a rebuild of that object itself.

On an other note, have you done any tests using the ZFS compression?
I'm wondering what the performance impact and efficiency are.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] use ZFS for OSDs

2014-10-29 Thread Michal Kozanecki
Forgot to mention, when you create the ZFS/ZPOOL datasets, make sure to set the 
xattar setting to sa

e.g.
  
zpool create osd01 -O xattr=sa -O compression=lz4 sdb

OR if zpool/zfs dataset already created

zfs set xattr=sa osd01

Cheers



-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Michal 
Kozanecki
Sent: October-29-14 11:33 AM
To: Kenneth Waegeman; ceph-users
Subject: Re: [ceph-users] use ZFS for OSDs

Hi Kenneth,

I run a small ceph test cluster using ZoL (ZFS on Linux) ontop of CentOS 7, so 
I'll try and answer any questions. :) 

Yes, ZFS writeparallel support is there, but NOT compiled in by default. You'll 
need to compile it with --with-zlib, but that by itself will fail to compile 
the ZFS support as I found out. You need to ensure you have ZoL installed and 
working, and then pass the location of libzfs to ceph at compile time. 
Personally I just set my environment variables before compiling like so;

ldconfig
export LIBZFS_LIBS=/usr/include/libzfs/
export LIBZFS_CFLAGS=-I/usr/include/libzfs -I/usr/include/libspl

However, the writeparallel performance isn't all that great. The writeparallel 
mode makes heavy use of ZFS's (and BtrFS's for that matter) snapshotting 
capability, and the snap performance on ZoL, at least when I last tested it, is 
pretty terrible. You lose any performance benefits you gain with writeparallel 
to the poor snap performance. 

If you decide that you don't need writeparallel mode you, can use the prebuilt 
packages (or compile with default options) without issue. Ceph (without zlib 
support compiled in) will detect ZFS as a generic/ext4 file system and work 
accordingly. 

As far as performance tweaking, ZIL, write journals and etc, I found that the 
performance difference between using a ZIL vs ceph write journal is about the 
same. I also found that doing both (ZIL AND writejournal) didn't give me much 
of a performance benefit. In my small test cluster I decided after testing to 
forego the ZIL and only use a SSD backed ceph write journal on each OSD, with 
each OSD being a single ZFS dataset/vdev(no zraid or mirroring). With Ceph 
handling the redundancy at the OSD level I saw no need for using ZFS mirroring 
or zraid, instead if ZFS detects corruption instead of self-healing it sends a 
read failure of the pg file to ceph, and then ceph's scrub mechanisms should 
then repair/replace the pg file using a good replica elsewhere on the cluster. 
ZFS + ceph are a beautiful bitrot fighting match!

Let me know if there's anything else I can answer. 

Cheers

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Kenneth Waegeman
Sent: October-29-14 6:09 AM
To: ceph-users
Subject: [ceph-users] use ZFS for OSDs

Hi,

We are looking to use ZFS for our OSD backend, but I have some questions.

My main question is: Does Ceph already supports the writeparallel mode for ZFS 
? (as described here:  
http://www.sebastien-han.fr/blog/2013/12/02/ceph-performance-interesting-things-going-on/)
I've found this, but I suppose it is outdated:  
https://wiki.ceph.com/Planning/Blueprints/Emperor/osd%3A_ceph_on_zfs

Should Ceph be build with ZFS support? I found a --with-zfslib option 
somewhere, but can someone verify this, or better has instructions for
it?:-)

What parameters should be tuned to use this?
I found these :
 filestore zfs_snap = 1
 journal_aio = 0
 journal_dio = 0

Are there other things we need for it?

Many thanks!!
Kenneth

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] use ZFS for OSDs

2014-10-29 Thread Michal Kozanecki
Hi Stijn,

Yes, on my cluster I am running; CentOS 7, ZoL 0.6.3, Ceph 80.5.

Cheers


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Stijn 
De Weirdt
Sent: October-29-14 3:49 PM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] use ZFS for OSDs


hi michal,

thanks for the info. we will certainly try it and see if we come to the same 
conclusions ;)

one small detail: since you were using centos7, i'm assuming you were using ZoL 
0.6.3?

stijn

On 10/29/2014 08:03 PM, Michal Kozanecki wrote:
 Forgot to mention, when you create the ZFS/ZPOOL datasets, make sure 
 to set the xattar setting to sa

 e.g.

 zpool create osd01 -O xattr=sa -O compression=lz4 sdb

 OR if zpool/zfs dataset already created

 zfs set xattr=sa osd01

 Cheers



 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf 
 Of Michal Kozanecki
 Sent: October-29-14 11:33 AM
 To: Kenneth Waegeman; ceph-users
 Subject: Re: [ceph-users] use ZFS for OSDs

 Hi Kenneth,

 I run a small ceph test cluster using ZoL (ZFS on Linux) ontop of 
 CentOS 7, so I'll try and answer any questions. :)

 Yes, ZFS writeparallel support is there, but NOT compiled in by 
 default. You'll need to compile it with --with-zlib, but that by 
 itself will fail to compile the ZFS support as I found out. You need 
 to ensure you have ZoL installed and working, and then pass the 
 location of libzfs to ceph at compile time. Personally I just set my 
 environment variables before compiling like so;

 ldconfig
 export LIBZFS_LIBS=/usr/include/libzfs/
 export LIBZFS_CFLAGS=-I/usr/include/libzfs -I/usr/include/libspl

 However, the writeparallel performance isn't all that great. The 
 writeparallel mode makes heavy use of ZFS's (and BtrFS's for that matter) 
 snapshotting capability, and the snap performance on ZoL, at least when I 
 last tested it, is pretty terrible. You lose any performance benefits you 
 gain with writeparallel to the poor snap performance.

 If you decide that you don't need writeparallel mode you, can use the 
 prebuilt packages (or compile with default options) without issue. Ceph 
 (without zlib support compiled in) will detect ZFS as a generic/ext4 file 
 system and work accordingly.

 As far as performance tweaking, ZIL, write journals and etc, I found that the 
 performance difference between using a ZIL vs ceph write journal is about the 
 same. I also found that doing both (ZIL AND writejournal) didn't give me much 
 of a performance benefit. In my small test cluster I decided after testing to 
 forego the ZIL and only use a SSD backed ceph write journal on each OSD, with 
 each OSD being a single ZFS dataset/vdev(no zraid or mirroring). With Ceph 
 handling the redundancy at the OSD level I saw no need for using ZFS 
 mirroring or zraid, instead if ZFS detects corruption instead of self-healing 
 it sends a read failure of the pg file to ceph, and then ceph's scrub 
 mechanisms should then repair/replace the pg file using a good replica 
 elsewhere on the cluster. ZFS + ceph are a beautiful bitrot fighting match!

 Let me know if there's anything else I can answer.

 Cheers

 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf 
 Of Kenneth Waegeman
 Sent: October-29-14 6:09 AM
 To: ceph-users
 Subject: [ceph-users] use ZFS for OSDs

 Hi,

 We are looking to use ZFS for our OSD backend, but I have some questions.

 My main question is: Does Ceph already supports the writeparallel mode for 
 ZFS ? (as described here:
 http://www.sebastien-han.fr/blog/2013/12/02/ceph-performance-interesti
 ng-things-going-on/) I've found this, but I suppose it is outdated:
 https://wiki.ceph.com/Planning/Blueprints/Emperor/osd%3A_ceph_on_zfs

 Should Ceph be build with ZFS support? I found a --with-zfslib option 
 somewhere, but can someone verify this, or better has instructions for
 it?:-)

 What parameters should be tuned to use this?
 I found these :
   filestore zfs_snap = 1
   journal_aio = 0
   journal_dio = 0

 Are there other things we need for it?

 Many thanks!!
 Kenneth

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph on RHEL 7 with multiple OSD's

2014-09-09 Thread Michal Kozanecki
Network issue maybe? Have you checked your firewall settings? Iptables changed 
a bit in EL7 and might of broken any rules your normally try and use, try 
flushing the rules (iptables -F) and see if that fixes things, if you then 
you'll need to fix your firewall rules. 

I ran into a similar issue on EL7 where the OSD's appeared up and in, but were 
stuck in peering which was due to a few ports being blocked.

Cheers

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of BG
Sent: September-09-14 6:05 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph on RHEL 7 with multiple OSD's

Loic Dachary loic@... writes:

 
 Hi,
 
 It it looks like your osd.0 is down and you only have one osd left 
 (osd.1) which would explain why the cluster cannot get to a healthy 
 state. The size 2 in  pool 0 'data' replicated size 2 ... means 
 the pool needs at least two OSDs up to function properly. Do you know why the 
 osd.0 is not up ?
 
 Cheers
 

I've been trying unsuccessfully to get this up and running since. I've added 
another OSD but still can't get to active + clean state. I'm not even sure if 
the problems I'm having are related to the OS version but I'm running out of 
ideas and unless somebody here can spot something obvious in the logs below I'm 
going to try rolling back to CentOS 6.

$ echo HEALTH  ceph health  echo STATUS  ceph status  echo 
OSD_DUMP  ceph osd dump HEALTH HEALTH_WARN 129 pgs peering; 129 pgs stuck 
unclean STATUS
cluster f68332e4-1081-47b8-9b22-e5f3dc1f4521
 health HEALTH_WARN 129 pgs peering; 129 pgs stuck unclean
 monmap e1: 1 mons at {hp09=10.119.16.14:6789/0}, election epoch 2, quorum
 0 hp09
 osdmap e43: 3 osds: 3 up, 3 in
  pgmap v61: 192 pgs, 3 pools, 0 bytes data, 0 objects
15469 MB used, 368 GB / 383 GB avail
 129 peering
  63 active+clean
OSD_DUMP
epoch 43
fsid f68332e4-1081-47b8-9b22-e5f3dc1f4521
created 2014-09-09 10:42:35.490711
modified 2014-09-09 10:47:25.077178
flags
pool 0 'data' replicated size 3 min_size 2 crush_ruleset 0 object_hash rjenkins 
pg_num 64 pgp_num 64 last_change 1 flags hashpspool crash_replay_interval 45 
stripe_width 0 pool 1 'metadata' replicated size 3 min_size 2 crush_ruleset 0 
object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool 
stripe_width 0 pool 2 'rbd' replicated size 3 min_size 2 crush_ruleset 0 
object_hash rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool 
stripe_width 0 max_osd 3
osd.0 up   in  weight 1 up_from 4 up_thru 42 down_at 0 last_clean_interval
[0,0) 10.119.16.14:6800/24988 10.119.16.14:6801/24988 10.119.16.14:6802/24988
10.119.16.14:6803/24988 exists,up 63f3f351-eccc-4a98-8f18-e107bd33f82b
osd.1 up   in  weight 1 up_from 38 up_thru 42 down_at 36 last_clean_interval
[7,37) 10.119.16.15:6800/22999 10.119.16.15:6801/4022999
10.119.16.15:6802/4022999 10.119.16.15:6803/4022999 exists,up
8e1c029d-ebfb-4a8d-b567-ee9cd9ebd876
osd.2 up   in  weight 1 up_from 42 up_thru 42 down_at 40 last_clean_interval
[11,41) 10.119.16.16:6800/25605 10.119.16.16:6805/5025605
10.119.16.16:6806/5025605 10.119.16.16:6807/5025605 exists,up
5d398bba-59f5-41f8-9bd6-aed6a0204656

Sample of warnings from monitor log:
2014-09-09 10:51:10.636325 7f75037d0700  1 mon.hp09@0(leader).osd e72 
prepare_failure osd.1 10.119.16.15:6800/22999 from osd.2
10.119.16.16:6800/25605 is reporting failure:1
2014-09-09 10:51:10.636343 7f75037d0700  0 log [DBG] : osd.1
10.119.16.15:6800/22999 reported failed by osd.2 10.119.16.16:6800/25605

Sample of warnings from osd.2 log:
2014-09-09 10:44:13.723714 7fb828c57700 -1 osd.2 18 heartbeat_check: no reply 
from osd.1 ever on either front or back, first ping sent 2014-09-09
10:43:30.437170 (cutoff 2014-09-09 10:43:53.723713)
2014-09-09 10:44:13.724883 7fb81f2f9700  0 log [WRN] : map e19 wrongly marked 
me down
2014-09-09 10:44:13.726104 7fb81f2f9700  0 osd.2 19 crush map has features 
1107558400, adjusting msgr requires for mons
2014-09-09 10:44:13.726741 7fb811edb700  0 -- 10.119.16.16:0/25605 
10.119.16.15:6806/1022999 pipe(0x3171900 sd=34 :0 s=1 pgs=0 cs=0 l=1 
c=0x3ad8580).fault



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NAS on RBD

2014-09-09 Thread Michal Kozanecki
Hi Blair!

On 9 September 2014 08:47, Blair Bethwaite blair.bethwa...@gmail.com wrote:
 Hi Dan,

 Thanks for sharing!

 On 9 September 2014 20:12, Dan Van Der Ster daniel.vanders...@cern.ch wrote:
 We do this for some small scale NAS use-cases, with ZFS running in a VM with 
 rbd volumes. The performance is not great (especially since we throttle the 
 IOPS of our RBD). We also tried a few kRBD / ZFS servers with an SSD ZIL — 
 the SSD solves any performance problem we ever had with ZFS on RBD.

 That's good to hear. My limited experience doing this on a smaller Ceph 
 cluster (and without any SSD journals or cache devices for ZFS
 head) points to write latency being an immediate issue, decent PCIe SLC SSD 
 devices should pretty much sort that out given the cluster itself has plenty 
 of write throughput available. Then there's further MLC devices for L2ARC - 
 not sure yet but guessing metadata heavy datasets might require 
 primarycache=metadata and rely of L2ARC for data cache. And all this should 
 get better in the medium term with performance improvements and RDMA 
 capability (we're building this with that option in the hole).


I'd love to go back and forth with you privately or on one of the ZFS 
mailing-lists if you want to discuss ZFS tuning in depth, but I want to just 
mention that setting primarycache=metadata will also cause the L2ARC to ONLY 
store and accelerate metadata as well(despite whatever secondarycache is set 
to). I believe this is something that the ZFS developers are looking to improve 
eventually but as-is, currently that’s how it works (L2ARC only contains what 
was pushed out of the main in-memory ARC). 

 I would say though that this setup is rather adventurous. ZoL is not rock 
 solid — we’ve had a few lockups in testing, all of which have been fixed in 
 the latest ZFS code in git (my colleague in CC could elaborate if you’re 
 interested).

 Hmm okay, that's not great. The only problem I've experienced thus far is 
 when the ZoL repos stopped providing DKMS and borked an upgrade for me until 
 I figured out what had happened and cleaned up the old .ko files. So yes, 
 interested to hear elaboration on that.


You mentioned in one of your other emails that if you deployed this idea of a 
ZFS NFS server, you'd do it inside a KVM VM and make use of librbd rather than 
krbd. If you're worried about ZoL stability and feel comfortable going outside 
Linux, you could always go with a *BSD or Illumos distro where ZFS support is 
much more stable/solid. 
In any case I haven't had any major show stopping issues with ZoL myself and I 
use it heavily. Still, unless you're really comfortable with ZoL or 
*BSD/Illumos(as I am), I'd likely recommend looking into other solutions.

 One thing I’m not comfortable with is the idea of ZFS checking the data in 
 addition to Ceph. Sure, ZFS will tell us if there is a checksum error, but 
 without any redundancy at the ZFS layer there will be no way to correct that 
 error. Of course, the hope is that RADOS will ensure 100% data consistency, 
 but what happens if not?...
 
 The ZFS checksumming would tell us if there has been any corruption, which as 
 you've pointed out shouldn't happen anyway on top of Ceph.

Just want to quickly address this, someone correct me if I'm wrong, but IIRC 
even with replica value of 3 or more, ceph does not(currently) have any 
intelligence when it detects a corrupted/incorrect PG, it will always 
replace/repair the PG with whatever data is in the primary, meaning that if the 
primary PG is the one that’s corrupted/bit-rotted/incorrect, it will replace 
the good replicas with the bad.  

 But if we did have some awful disaster scenario where that happened then we'd 
 be restoring from tape, and it'd sure be good to know which files actually 
 needed restoring. I.e., if we lost a single PG at the Ceph level then we 
 don't want to have to blindly restore the whole zpool or dataset.

 Personally, I think you’re very brave to consider running 2PB of ZoL on RBD. 
 If I were you I would seriously evaluate the CephFS option. It used to be on 
 the roadmap for ICE 2.0 coming out this fall, though I noticed its not there 
 anymore (??!!!).

 Yeah, it's very disappointing that this was silently removed. And it's 
 particularly concerning that this happened post RedHat acquisition.
 I'm an ICE customer and sure would have liked some input there for exactly 
 the reason we're discussing.


I'm looking forward to CephFS as well, and I agree, it's somewhat concerning 
that it happened post RedHat acquisition. I'm hoping RedHat pours more 
resources into InkTank and ceph, and not instead leach resources away from them.

 Anyway I would say that ZoL on kRBD is not necessarily a more stable 
 solution than CephFS. Even Gluster striped on top of RBD would probably be 
 more stable than ZoL on RBD.

 If we really have to we'll just run Gluster natively instead (or perhaps XFS 
 on RBD as the option before that) - the hardware