Re: Adding flashcache for data disk to cache Ceph metadata writes

Mark Nelson Tue, 15 Jan 2013 20:40:15 -0800

Hi Xiaoxi,

That's fantastic!  This is really exciting!  Congrats on getting this
working.  Once I have free time again I will try to replicate your
findings.  Again, great job!


Mark

On 01/15/2013 10:00 PM, Chen, Xiaoxi wrote:
> Hi List,
>       I have introduced flashcache (https://github.com/facebook/flashcache) 
> aim at reduce Ceph metadata IOs to OSD's disk. Basically, for every data 
> writes, ceph need to write 3 things:
> Pg log
> Pg info
> Actual data
>       First 2 requests are small, but for non-btrfs filesystem, the first 2 
> writes will results OSD disk to do 2 seeks, it's critical to spindle-disk's 
> throughput as mentioned in earlier mail.
> 
>       I list the detail of my experiment , any inputs is highly appreciate.
> 
>       [Setup]
>       2 Host, 1 SSD with 1 SATA disk, SSD is partitioned into 4 partitions. 
> P1 as OSD Journal, P2 as FlashCache for sata. P3 used as XFS metadata journal.
>       1 Client,1 RBD Volume created and mounted
>       [FlashCache setup]
>               [Create cached device]
>                       flashcache_create -v -p back fsdc /dev/sda2 /dev/sdc
>               [Create filesystem]
>                       mkfs.xfs -f -i size=2048 -d agcount=1 -l 
> logdev=/dev/sda3,size=128m /dev/mapper/fsdc
>               [Mount]
>                       mount -o logdev=/dev/sda3 -o logbsize=256k -o delaylog 
> -o inode64 /dev/mapper/fsdc /data/osd.21/
>               [Tuning]
>                       sysctl dev.flashcache.sda9+sdc.skip_seq_thresh_kb=32
>                       
>                       Since I am aiming to cache only ceph metadata and the 
> metadata writes are very small, so I configured flashcache to skip all 
> sequential write larger than 32K. Basically you can set this to 1K because 
> the                     meta writes are all less than 1K. I set it to 32K 
> just for a quick test.
>       [Experiment]
>               Doing dd from the client on top of the RBD Volume
>       [Result]
>               Throughput boost from 37MB/s to ~ 90MB/s, since the flashcache 
> working in DM level, it's transparent to Ceph.
> 
>       My test is just a quick test, further test (include sequential 
> R/W,random R/W) are in schedule. Will come back with you if there are some 
> progress.
>                               
>                                                                               
>                                                                               
>                                                                               
>                                                             Xiaoxi
> 
> -----Original Message-----
> From: Sage Weil [mailto:[email protected]]
> Sent: 2013年1月16日 5:43
> To: Chen, Xiaoxi
> Cc: Mark Nelson; Yan, Zheng
> Subject: RE: Seperate metadata disk for OSD
> 
> On Tue, 15 Jan 2013, Chen, Xiaoxi wrote:
>> Hi Sage,
>>      FlashCache works well for this scenarios, I created a hybrid-disk with 
>> 1 ssd partition(shared the same ssd but different patition with Ceph journal 
>> and XFS journal) and 1 sata disk.Configured the FlashCache to ignore all 
>> sequential request larger than 32K(Well, it can be set to a smaller number).
>>      The results shows a comparable performance with CephMeta-to-ssd 
>> solution.
>>      Since flashcache working in the DM layer , I suppose it's transparent
>> to Ceph, right?
> 
> Right.  That's great to hear that it works well.  If you don't mind, it would 
> be great if you could report the same thing to ceph-devel with a bit of 
> detail about how you configured FlashCache so that others can do the same.
> 
> Thanks!
> sage
> 
> 
>>                                                                              
>>                                                                              
>>                                            Xiaoxi
>>
>> -----Original Message-----
>> From: Sage Weil [mailto:[email protected]]
>> Sent: 2013?1?15? 2:19
>> To: Chen, Xiaoxi
>> Cc: Mark Nelson; Yan, Zheng ; [email protected]
>> Subject: RE: Seperate metadata disk for OSD
>>
>> On Mon, 14 Jan 2013, Chen, Xiaoxi wrote:
>>> Hi Sage,
>>>      Thanks for your mail~
>>>     Would you have a timetable about when such improvement can be 
>>> ready?It's critical for non-btrfs filesystem.
>>>     I am thinking about introducing flashcache into my configuration to 
>>> cache such meta write, since flashcache working under the filesystem, I 
>>> suppose it will not break the assumption inside Ceph. I will try it on 
>>> tomorrow and come back with you ~
>>>     Thanks again for the helps!
>>
>> I think avoiding the pginfo change may be pretty simple.  The log one I am a 
>> bit less concerned about (the appends from many rbd IOs will get aggregated 
>> into a small number of XFS IOs), and changing that around would be a bigger 
>> deal.
>>
>> sage
>>
>>
>>>                                                                             
>>>                                                                             
>>>                                     Xiaoxi
>>>
>>> -----Original Message-----
>>> From: Sage Weil [mailto:[email protected]]
>>> Sent: 2013?1?13? 0:57
>>> To: Chen, Xiaoxi
>>> Cc: Mark Nelson; Yan, Zheng ; [email protected]
>>> Subject: RE: Seperate metadata disk for OSD
>>>
>>> On Sat, 12 Jan 2013, Chen, Xiaoxi wrote:
>>>> Hi Zheng?
>>>>    I have put XFS log to a separate disk, indeed it provide some 
>>>> performance gain but not that significant.
>>>>    Ceph's metadata is somehow separate(it's some files reside in OSD's 
>>>> disk), therefore,it cannot be helped by neither XFS journal log nor OSD's 
>>>> journal.That's why I am trying to put ceph's metadata(/data/osd.x/meta 
>>>> folder ) to a separate SSD disk.
>>>> To Nelson,
>>>>    I did the experiment with just 1 client, if using more clients, the 
>>>> gain will not be that much.
>>>>    It looks to me that a single write from client side become 3 writes to 
>>>> disk is somehow a big overhead for in-place-update filesystem such like 
>>>> XFS since it introduce more seeks.Out-of-place-update filesystem will not 
>>>> suffer a lot for such pattern,I didn?t find this problem when I using 
>>>> BTRFS as backend filesystem. But forBTRFS, fragmentation is another 
>>>> performance killer, for a single RBD volume, if you did a lot of random 
>>>> write on it, the sequential read performance will drop to 30% of a new RBD 
>>>> volume. This make BTRFS unusable in production.
>>>>    Separate Ceph meta seems quite easy to me ( I just mount a partition to 
>>>> /data/osd.X/meta), is it right  ? is there any potential problem in it?
>>>
>>> Unfortunately, yes.  The ceph journal and fs sync are carefully timed.
>>> The ceph-osd assumes that syncfs(2) on the $osd_data/current/commit_op_seq 
>>> file will sync everything, but if meta/ is another fs that isnt true.  At 
>>> the every least, the code needs to be modified to sync that as well.
>>>
>>> That said, there is a lot of improvement that can be had here.  The three 
>>> things we write are:
>>>
>>>   the pg log
>>>   the pg info, spread across the pg dir xattr and that pginfo file
>>> the actual io
>>>
>>> The pg log could go in leveldb, which would translate those writes into a 
>>> single sequential stream across the entire OSD.  And the PG info separate 
>>> between the xattr and the file is far from optimal: most of that data 
>>> doesn't actually change on each write.  What little does is very small, and 
>>> could be moved into the xattr, avoiding touching the file (which means an 
>>> inode + data block write) at all.
>>>
>>> We need to look a bit more closely to see how difficult that will really be 
>>> to implement, but I think it is promising!
>>>
>>> sage
>>>
>>>
>>>>
>>>>                                                                            
>>>>                                                                            
>>>>                                                                            
>>>>           Xiaoxi -----Original Message-----
>>>> From: Mark Nelson [mailto:[email protected]]
>>>> Sent: 2013?1?12? 21:36
>>>> To: Yan, Zheng
>>>> Cc: Chen, Xiaoxi; [email protected]
>>>> Subject: Re: Seperate metadata disk for OSD
>>>>
>>>> Hi Xiaoxi and Zheng,
>>>>
>>>> We've played with both of these some internally, but not for a production 
>>>> deployment. Mostly just for diagnosing performance problems.
>>>>    It's been a while since I last played with this, but I hadn't seen a 
>>>> whole lot of performance improvements at the time.  That may have been due 
>>>> to the hardware in use, or perhaps other parts of Ceph have improved to 
>>>> the point where this matters now!
>>
>>>>
>>>> On a side note, Btrfs also had a google summer of code project to let you 
>>>> put metadata on an external device.  Originally I think that was supposed 
>>>> to make it into 3.7, but am not sure if that happened.
>>>>
>>>> Mark
>>>>
>>>> On 01/12/2013 06:21 AM, Yan, Zheng wrote:
>>>>> On Sat, Jan 12, 2013 at 2:57 PM, Chen, Xiaoxi <[email protected]> 
>>>>> wrote:
>>>>>>
>>>>>> Hi list,
>>>>>>           For a rbd write request, Ceph need to do 3 writes:
>>>>>> 2013-01-10 13:10:15.539967 7f52f516c700 10
>>>>>> filestore(/data/osd.21) _do_transaction on 0x327d790
>>>>>> 2013-01-10 13:10:15.539979 7f52f516c700 15
>>>>>> filestore(/data/osd.21) write meta/516b801c/pglog_2.1a/0//-1
>>>>>> 36015~147
>>>>>> 2013-01-10 13:10:15.540016 7f52f516c700 15
>>>>>> filestore(/data/osd.21)
>>>>>> path:
>>>>>> /data/osd.21/current/meta/DIR_C/pglog\u2.1a__0_516B801C__none
>>>>>> 2013-01-10 13:10:15.540164 7f52f516c700 15
>>>>>> filestore(/data/osd.21) write meta/28d2f4a8/pginfo_2.1a/0//-1
>>>>>> 0~496
>>>>>> 2013-01-10 13:10:15.540189 7f52f516c700 15
>>>>>> filestore(/data/osd.21)
>>>>>> path:
>>>>>> /data/osd.21/current/meta/DIR_8/pginfo\u2.1a__0_28D2F4A8__none
>>>>>> 2013-01-10 13:10:15.540217 7f52f516c700 10
>>>>>> filestore(/data/osd.21) _do_transaction on 0x327d708
>>>>>> 2013-01-10 13:10:15.540222 7f52f516c700 15
>>>>>> filestore(/data/osd.21) write
>>>>>> 2.1a_head/8abf341a/rb.0.106e.6b8b4567.0000000002d3/head//2
>>>>>> 3227648~524288
>>>>>> 2013-01-10 13:10:15.540245 7f52f516c700 15
>>>>>> filestore(/data/osd.21)
>>>>>> path:
>>>>>> /data/osd.21/current/2.1a_head/rb.0.106e.6b8b4567.0000000002d3__
>>>>>> he
>>>>>> ad
>>>>>> _8
>>>>>> ABF341A__2
>>>>>> l
>>>>>>           If using XFS as backend file system and running xfs on top of 
>>>>>> traditional sata disk, it will introduce a lot of seeks and therefore 
>>>>>> reduce bandwidth, a blktrace is available here :( 
>>>>>> http://ww3.sinaimg.cn/mw690/6e1aee47jw1e0qsbxbvddj.jpg) to demonstrate 
>>>>>> this issue.( single client running dd on top of a new RBD volumes).
>>>>>>           Then I tried to move /osd.X/current/meta to a separate disk, 
>>>>>> the bandwidth boosted.(look blktrace at 
>>>>>> http://ww4.sinaimg.cn/mw690/6e1aee47jw1e0qsadz1bij.jpg).
>>>>>>           I haven't test other access pattern or something else, but it 
>>>>>> looks to me that moving such meta to a separate disk (ssd or sata with 
>>>>>> btrfs) will benefit ceph write performance, is it true? Will ceph 
>>>>>> introduce this feature in the future?  Is there any potential problem 
>>>>>> for such hack?
>>>>>>
>>>>>
>>>>> Did you try putting XFS metadata log a separate and fast device
>>>>> (mkfs.xfs -l logdev=/dev/sdbx,size=10000b). I think it will
>>>>> boost performance too.
>>>>>
>>>>> Regards
>>>>> Yan, Zheng
>>>>> --
>>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>>>> in the body of a message to [email protected] More
>>>>> majordomo info at  http://vger.kernel.org/majordomo-info.html
>>>>>
>>>>
>>>> N????y????b?????v?????{.n??????z??ay????????j 
>>>> ???f????????????????:+v???????? ??zZ+??????"?!?
>>> N????y????b?????v?????{.n??????z??ay????????j 
>>> ???f????????????????:+v???????? ??zZ+??????"?!?
>>
> N�Р骒r��y����b�X�肚�v�^�)藓{.n�+���z�]z鳐�{ay������,j��f＂�h���z��wア�
> ⒎�j:+v���w�j�m������赙zZ+�����茛j"��!tml=
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Adding flashcache for data disk to cache Ceph metadata writes

Reply via email to