Hi, folks.
I'm conducting a series of experiments and tests with CephFS and have been
facing a behavior over which I can't seem to have much control.
I configured a 5-node Ceph cluster running on enterprise servers. Each server
has 10 x 6TB HDDs and 2 x 800GB SSDs. I configured the SSDs as a RAID-1 device
for journaling and also two of the HDDs for the same purpose for the sake of
comparison. All other 8 HDDs are configured as OSDs. The servers have 196GB of
RAM and our private network is backed by a 40GB/s Brocade switch (frontend is
10Gb/s).
When benchmarking the HDDs directly, here's the performance I get:
dd if=/dev/zero of=/var/lib/ceph/osd/ceph-0/deleteme bs=10G count=1
oflag=direct &
0+1 records in0+1 records out2147479552 bytes (2.1 GB) copied, 11.684 s, 184
MB/s
For read performance:
dd if=/var/lib/ceph/osd/ceph-0/deleteme of=/dev/null bs=10G count=1
iflag=direct &
0+1 records in0+1 records out2147479552 bytes (2.1 GB) copied, 8.30168 s, 259
MB/s
Now, when I benchmark the OSDs configured with HDD-based journaling, here's
what I get:
[root@cephnode1 ceph-cluster]# ceph tell osd.1 bench
{ "bytes_written": 1073741824, "blocksize": 4194304, "bytes_per_sec":
426840870.000000}
which looks coherent. If I switch to the SDD-based journal, here's the new
figure:
[root@cephnode1 ~]# ceph tell osd.1 bench{ "bytes_written": 1073741824,
"blocksize": 4194304, "bytes_per_sec": 805229549.000000}
which, again, looks as expected to me.
Finally, when I run the rados bench, here's what I get:
rados bench -p cephfs_data 300 write --no-cleanup && rados bench -p cephfs_data
300 seq
Total time run: 300.345098Total writes made: 48327Write size:
4194304Bandwidth (MB/sec): 643.620
Stddev Bandwidth: 114.222Max bandwidth (MB/sec): 1196Min bandwidth
(MB/sec): 0Average Latency: 0.0994289Stddev Latency: 0.112926Max
latency: 1.85983Min latency: 0.0139412
----------------------------------------
Total time run: 300.121930Total reads made: 31990Read size:
4194304Bandwidth (MB/sec): 426.360
Average Latency: 0.149346Max latency: 1.77489Min latency:
0.00382452
I configured the cluster to replicate data twice (3 copies), so these numbers
fall within my expectations. So far so good, but here's comes the issue: I
configured CephFS and mounted a share locally on one of my servers. When I
write data to it, it shows abnormally high performance at the beginning for
about 5 seconds, stalls for about 20 seconds and then picks up again. For long
running tests, the observed write throughput is very close to what the rados
bench provided (about 640 MB/s), but for short-lived tests, I get peak
performances of over 5GB/s. I know that journaling is expected to cause spiky
performance patters like that, but not to this level, which makes me think that
CephFS is buffering my writes and returning the control back to client before
persisting them to the jounal, which looks undesirable.
I searched the web for a couple of days looking for ways to disable this
apparent write buffering, but couldn't find anything. So here comes my
question: how can I disable it?
Thanks and regards,
F. Lucchese_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com