Hi Greg,
thank you for your (fast) answer.
Since we're going more in-depth, in must say :
* we're running 2 Gentoo GNU/Linux servers doing both storage and
virtualization (I know this is not recommended but we mostly have a
low load and virtually no writes outside of ceph)
* sys-cluster/ceph-0.56.4 USE="radosgw -debug -fuse -gtk -libatomic
-static-libs -tcmalloc"
* app-emulation/qemu-1.2.2-r3 USE="aio caps curl jpeg ncurses png rbd
sasl seccomp threads uuid vhost-net vnc-alsa -bluetooth -brltty
-debug -doc -fdt -mixemu -opengl -pulseaudio -python -sdl (-selinux)
-smartcard -spice -static -static-softmmu -static-user -systemtap
-tci -tls -usbredir -vde -virtfs -xattr -xen -xfs"
* app-emulation/libvirt-1.0.2-r2 USE="caps iscsi libvirtd lvm lxc
macvtap nls pcap python qemu rbd sasl udev vepa virt-network -audit
-avahi -firewalld -fuse -nfs -numa -openvz -parted -phyp -policykit
(-selinux) -uml -virtualbox -xen"
* 1 SSD, 3 HDDs per host.
* monitor filesystems on SSD
* OSD journals on SSD
* OSD data on spinnies
* [client]
rbd cache = true
rbd cache size = 128M
rbd cache max dirty = 32M
* We can pay for some support if required ;)
* I know cuttlefish has some scrub-related optimizations, but cannot
upgrade now
On 09/07/2013 13:04, Gregory Farnum wrote:
What kinds of performance drops are you seeing during recovery?
Mostly high latencies making some websites non responsive (LAMP stacks,
mostly). Same thing for some email servers. Another problem is that my
munin has difficulties fetching its data from VMs during scrubs (the
munin server is also a VM and writing at this time is okay).
On a sample host HDD, my latency averages are :
Read (ms)
Write (ms)
Utilization (%)
Read throughput (kB/s)
Write throughput (kB/s)
not scrubbing (07:26-09:58)
10.08
195.41
19.06
80.40
816.84
scrubbing (10:00-11:20)
14.02
198.08
27.73
102.30
797.76
On a sample web and email server :
data coverage (approx.)
Read (ms)
Write (ms)
not scrubbing (07:26-09:58) 100%
45.02
7.36
scrubbing (10:00-11:20) 20-30%
432.73
181.19
If for instance you've got clients sendings lots of operations that are small
compared to object size then the bounding won't work out quite right, or maybe
you're just knocking out a bunch of servers and getting bad long-tail latency
effects.
I think I can't answer this. I tend to think it's the first case,
because the drives don't seems to hit even 50% utilization (CPU is
around 3% and I have more than 40GB of "free" RAM).
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com