Rephrase it to make it more clear
From: [email protected]
[mailto:[email protected]] On Behalf Of Chen, Xiaoxi
Sent: 2013年3月25日 17:02
To: '[email protected]' ([email protected])
Cc: [email protected]
Subject: [ceph-users] Ceph Crach at sync_thread_timeout after heavy random
writes.
Hi list,
We have hit and reproduce this issue for several times, ceph will
suicide because FileStore: sync_entry timed out after a very heavy random IO on
top of the RBD.
My test environment is:
4 Nodes ceph cluster with 20 HDDs for OSDs and 4
Intel DCS3700 ssds for journal per node, that is 80 spindles in total
48 VMs spread across 12 Physical nodes, 48 RBD
attached to the VMs 1:1 via QEMU, The Qemu Cache disabled.
Ceph @ 0.58
XFS were used.
I am running Aiostress (something like FIO) inside VMS to produce
random write requests on top of each RBDs.
From Ceph-w , ceph reports a very high Ops (10000+ /s) , but
technically , 80 spindles can provide up to 150*80/2=6000 IOPS for 4K random
write.
When digging into the code, from Filestore.cc::_write(), it's clear
that the OSD open object files without O_DIRECT, that means data writes will be
buffered by pagecache, and then returned.Although ::sync_file_range called ,
but with flag "SYNC_FILE_RANGE_WRITE", this system call doesn’t actually sync
data to disk before it returns ,instead, it just initiate the write out IOs.
So the situation is , since all writes just go to pagecache
, the backend OSD data disk **seems** extremely fast for random write, so we
can see such a high Ops from Ceph-w. However, when OSD Sync_thread trying to
sync the FS, it use ::syncfs(), before ::syncfs returned, the OS has to ensure
that all dirty page in PageCache(relate with that particular FS) had written
into disk. This will obviously take long time and you can only expect 100 IOPS
for non-btrfs filesystem. The performance gap exists there, a SSD journal can
do 4K random wirte @ 1K IOPS +, but for 4 HDDs(journaled by the same SSD),
they can only provide 400IOPS.
With the random write pressure continuing , the amount of dirty page in
PageCache will keep increasing , sooner or later, the ::syncfs() cannot return
within 600s(the default value of filestore_commit_timeout ) and triggered the
ASSERT to suicide ceph-osd process.
I have tried to reproduce this by rados bench,but failed.Because rados bench
**create** objects rather than modify them, a bucket of creates can be merged
into a single big writes. So I assume if anyone like to reproduce this issue,
you have to use QEMU/Kernel Client, using a fast journal(say tempfs) , slow
data disk, choosing a small filestore_commit_timeout may be helpful to
reproduce this issue in a small scale environment.
Could you please let me know if you need any more informations & have
some solutions? Thanks
Xiaoxi