Hi all,
I'm trying to identify the performance bottlenecks in my experimental Ceph
cluster. A little background on my setup:
10 storage servers, each configured with:
-(2) dual-core opterons
-8 GB of RAM
-(6) 750GB disks (1 OSD per disk, 7200 RPM SATA, probably 4-5
years old.). JBOD w/ BTRFS
-1GbE
-CentOS 6.4, custom kernel 3.7.8
1 dedicated mds/mon server
-same specs at OSD nodes
(2 more dedicated mons waiting in the wings, recently reinstalled ceph)
1 front-facing node mounting CephFS, with a 10GbE connection into the
switch stack housing the storage machines
-CentOS 6.4, custom kernel 3.7.8
Some Ceph settings:
[osd]
osd journal size = 1000
filestore xattr use omap = true
When I try to transfer files in/out via CephFS (10GbE host), I'm seeing only
about 230MB/s at peak. First, is this what I should expect? Given 60 OSDs
spread across 10 servers, I would have thought I'd get something closer to
400-500 MB/s or more. I tried upping the number of placement groups to 3000 for
my 'data' pool (following the formula here:
http://ceph.com/docs/master/rados/operations/placement-groups/) with no
increase in performance. I also saw no performance difference between XFS and
BTRFS.
I also see a lot of messages like this in the log:
10.1.6.4:6815/30138 3518 : [WRN] slow request 30.874441 seconds old, received
at 2013-07-31 10:52:49.721518: osd_op(client.7763.1:67060 100000003ba.000013d4
[write 0~4194304] 0.102b9365 RETRY=-1 snapc 1=[] e1454) currently waiting for
subops from [1]
Does anyone have any thoughts as to what the bottleneck may be, if there is
one? Or, any idea what I should try to measure to determine the bottleneck?
Perhaps my disks are just that bad? :)
Cheers,
Lincoln_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com