Rephrase it to make it more clear

From: [email protected] 
[mailto:[email protected]] On Behalf Of Chen, Xiaoxi
Sent: 2013年3月25日 17:02
To: '[email protected]' ([email protected])
Cc: [email protected]
Subject: [ceph-users] Ceph Crach at sync_thread_timeout after heavy random 
writes.

Hi list,
         We have hit and reproduce this issue for several times, ceph will 
suicide because FileStore: sync_entry timed out after a very heavy random IO on 
top of the RBD.
         My test environment is:
                            4 Nodes ceph cluster with 20 HDDs for OSDs and 4 
Intel DCS3700 ssds for journal per node, that is 80 spindles in total
                            48 VMs spread across 12 Physical nodes, 48 RBD 
attached to the VMs 1:1 via QEMU, The Qemu Cache disabled.
                            Ceph @ 0.58
                            XFS were used.
         I am running  Aiostress (something like FIO) inside VMS to produce 
random write requests on top of each RBDs.

         From Ceph-w , ceph reports a very high Ops (10000+ /s) , but 
technically , 80 spindles can provide up to 150*80/2=6000 IOPS for 4K random 
write.
         When digging into the code, from Filestore.cc::_write(), it's clear 
that the OSD open object files without O_DIRECT, that means data writes will be 
buffered by pagecache, and then returned.Although ::sync_file_range called , 
but with flag "SYNC_FILE_RANGE_WRITE", this system call doesn’t actually sync 
data to disk before it returns ,instead, it just initiate the write out IOs. 
                    So the situation is , since all writes just go to pagecache 
, the backend OSD data disk **seems** extremely fast for random write, so we 
can see  such a high Ops from Ceph-w. However, when OSD Sync_thread trying to 
sync the FS, it use ::syncfs(), before ::syncfs returned, the OS has to ensure 
that all dirty page in PageCache(relate with that particular FS)  had written 
into disk. This will obviously take long time and you can only expect 100 IOPS 
for non-btrfs filesystem.   The performance gap exists there, a SSD journal can 
do 4K random wirte @  1K IOPS +, but for 4 HDDs(journaled  by the same SSD), 
they can only provide 400IOPS.
With the random write pressure continuing , the amount of dirty page in 
PageCache will keep increasing , sooner or later, the ::syncfs() cannot return 
within 600s(the default value of filestore_commit_timeout ) and triggered the 
ASSERT to suicide ceph-osd process.

   I have tried to reproduce this by rados bench,but failed.Because rados bench 
**create** objects rather than modify them, a bucket of creates can be merged 
into a single big writes. So I assume if anyone like to reproduce this issue, 
you have to use QEMU/Kernel Client, using a fast journal(say tempfs) , slow 
data disk, choosing a small filestore_commit_timeout may be helpful to 
reproduce this issue in a small scale environment.

         Could you please let me know if you need any more informations & have 
some solutions? Thanks
                                                                                
                                                                                
                                                                                
            Xiaoxi

Reply via email to