Adding flashcache for data disk to cache Ceph metadata writes

Chen, Xiaoxi Tue, 15 Jan 2013 20:01:32 -0800

Hi List,
        I have introduced flashcache (https://github.com/facebook/flashcache) 
aim at reduce Ceph metadata IOs to OSD's disk. Basically, for every data 
writes, ceph need to write 3 things:
Pg log
Pg info
Actual data
        First 2 requests are small, but for non-btrfs filesystem, the first 2 
writes will results OSD disk to do 2 seeks, it's critical to spindle-disk's 
throughput as mentioned in earlier mail.


        I list the detail of my experiment , any inputs is highly appreciate.

        [Setup]
        2 Host, 1 SSD with 1 SATA disk, SSD is partitioned into 4 partitions. 
P1 as OSD Journal, P2 as FlashCache for sata. P3 used as XFS metadata journal.
        1 Client,1 RBD Volume created and mounted
        [FlashCache setup]
                [Create cached device]
                        flashcache_create -v -p back fsdc /dev/sda2 /dev/sdc
                [Create filesystem]
                        mkfs.xfs -f -i size=2048 -d agcount=1 -l 
logdev=/dev/sda3,size=128m /dev/mapper/fsdc
                [Mount]
                        mount -o logdev=/dev/sda3 -o logbsize=256k -o delaylog 
-o inode64 /dev/mapper/fsdc /data/osd.21/
                [Tuning]
                        sysctl dev.flashcache.sda9+sdc.skip_seq_thresh_kb=32
                        
                        Since I am aiming to cache only ceph metadata and the 
metadata writes are very small, so I configured flashcache to skip all 
sequential write larger than 32K. Basically you can set this to 1K because the  
                   meta writes are all less than 1K. I set it to 32K just for a 
quick test.
        [Experiment]
                Doing dd from the client on top of the RBD Volume
        [Result]
                Throughput boost from 37MB/s to ~ 90MB/s, since the flashcache 
working in DM level, it's transparent to Ceph.

        My test is just a quick test, further test (include sequential 
R/W,random R/W) are in schedule. Will come back with you if there are some 
progress.
                                
                                                                                
                                                                                
                                                                                
                                                        Xiaoxi

-----Original Message-----
From: Sage Weil [mailto:[email protected]] 
Sent: 2013年1月16日 5:43
To: Chen, Xiaoxi
Cc: Mark Nelson; Yan, Zheng 
Subject: RE: Seperate metadata disk for OSD

On Tue, 15 Jan 2013, Chen, Xiaoxi wrote:
> Hi Sage,
>       FlashCache works well for this scenarios, I created a hybrid-disk with 
> 1 ssd partition(shared the same ssd but different patition with Ceph journal 
> and XFS journal) and 1 sata disk.Configured the FlashCache to ignore all 
> sequential request larger than 32K(Well, it can be set to a smaller number).
>       The results shows a comparable performance with CephMeta-to-ssd 
> solution.
>       Since flashcache working in the DM layer , I suppose it's transparent 
> to Ceph, right?

Right.  That's great to hear that it works well.  If you don't mind, it would 
be great if you could report the same thing to ceph-devel with a bit of detail 
about how you configured FlashCache so that others can do the same.

Thanks!
sage


>                                                                               
>                                                                               
>                                           Xiaoxi
> 
> -----Original Message-----
> From: Sage Weil [mailto:[email protected]]
> Sent: 2013?1?15? 2:19
> To: Chen, Xiaoxi
> Cc: Mark Nelson; Yan, Zheng ; [email protected]
> Subject: RE: Seperate metadata disk for OSD
> 
> On Mon, 14 Jan 2013, Chen, Xiaoxi wrote:
> > Hi Sage,
> >     Thanks for your mail~
> >     Would you have a timetable about when such improvement can be 
> > ready?It's critical for non-btrfs filesystem.
> >     I am thinking about introducing flashcache into my configuration to 
> > cache such meta write, since flashcache working under the filesystem, I 
> > suppose it will not break the assumption inside Ceph. I will try it on 
> > tomorrow and come back with you ~
> >     Thanks again for the helps!
> 
> I think avoiding the pginfo change may be pretty simple.  The log one I am a 
> bit less concerned about (the appends from many rbd IOs will get aggregated 
> into a small number of XFS IOs), and changing that around would be a bigger 
> deal.
> 
> sage
> 
> 
> >                                                                             
> >                                                                             
> >                                     Xiaoxi
> > 
> > -----Original Message-----
> > From: Sage Weil [mailto:[email protected]]
> > Sent: 2013?1?13? 0:57
> > To: Chen, Xiaoxi
> > Cc: Mark Nelson; Yan, Zheng ; [email protected]
> > Subject: RE: Seperate metadata disk for OSD
> > 
> > On Sat, 12 Jan 2013, Chen, Xiaoxi wrote:
> > > Hi Zheng?
> > >   I have put XFS log to a separate disk, indeed it provide some 
> > > performance gain but not that significant.
> > >   Ceph's metadata is somehow separate(it's some files reside in OSD's 
> > > disk), therefore,it cannot be helped by neither XFS journal log nor OSD's 
> > > journal.That's why I am trying to put ceph's metadata(/data/osd.x/meta 
> > > folder ) to a separate SSD disk.
> > > To Nelson,
> > >   I did the experiment with just 1 client, if using more clients, the 
> > > gain will not be that much.
> > >   It looks to me that a single write from client side become 3 writes to 
> > > disk is somehow a big overhead for in-place-update filesystem such like 
> > > XFS since it introduce more seeks.Out-of-place-update filesystem will not 
> > > suffer a lot for such pattern,I didn?t find this problem when I using 
> > > BTRFS as backend filesystem. But forBTRFS, fragmentation is another 
> > > performance killer, for a single RBD volume, if you did a lot of random 
> > > write on it, the sequential read performance will drop to 30% of a new 
> > > RBD volume. This make BTRFS unusable in production.
> > >   Separate Ceph meta seems quite easy to me ( I just mount a partition to 
> > > /data/osd.X/meta), is it right  ? is there any potential problem in it? 
> > 
> > Unfortunately, yes.  The ceph journal and fs sync are carefully timed.  
> > The ceph-osd assumes that syncfs(2) on the $osd_data/current/commit_op_seq 
> > file will sync everything, but if meta/ is another fs that isnt true.  At 
> > the every least, the code needs to be modified to sync that as well.
> > 
> > That said, there is a lot of improvement that can be had here.  The three 
> > things we write are:
> > 
> >  the pg log
> >  the pg info, spread across the pg dir xattr and that pginfo file  
> > the actual io
> > 
> > The pg log could go in leveldb, which would translate those writes into a 
> > single sequential stream across the entire OSD.  And the PG info separate 
> > between the xattr and the file is far from optimal: most of that data 
> > doesn't actually change on each write.  What little does is very small, and 
> > could be moved into the xattr, avoiding touching the file (which means an 
> > inode + data block write) at all.
> > 
> > We need to look a bit more closely to see how difficult that will really be 
> > to implement, but I think it is promising!
> > 
> > sage
> > 
> > 
> > > 
> > >                                                                           
> > >                                                                           
> > >                                                                           
> > >             Xiaoxi -----Original Message-----
> > > From: Mark Nelson [mailto:[email protected]]
> > > Sent: 2013?1?12? 21:36
> > > To: Yan, Zheng
> > > Cc: Chen, Xiaoxi; [email protected]
> > > Subject: Re: Seperate metadata disk for OSD
> > > 
> > > Hi Xiaoxi and Zheng,
> > > 
> > > We've played with both of these some internally, but not for a production 
> > > deployment. Mostly just for diagnosing performance problems. 
> > >   It's been a while since I last played with this, but I hadn't seen a 
> > > whole lot of performance improvements at the time.  That may have been 
> > > due to the hardware in use, or perhaps other parts of Ceph have improved 
> > > to the point where this matters now!
> 
> > > 
> > > On a side note, Btrfs also had a google summer of code project to let you 
> > > put metadata on an external device.  Originally I think that was supposed 
> > > to make it into 3.7, but am not sure if that happened.
> > > 
> > > Mark
> > > 
> > > On 01/12/2013 06:21 AM, Yan, Zheng wrote:
> > > > On Sat, Jan 12, 2013 at 2:57 PM, Chen, Xiaoxi <[email protected]> 
> > > > wrote:
> > > >>
> > > >> Hi list,
> > > >>          For a rbd write request, Ceph need to do 3 writes:
> > > >> 2013-01-10 13:10:15.539967 7f52f516c700 10
> > > >>filestore(/data/osd.21) _do_transaction on 0x327d790
> > > >> 2013-01-10 13:10:15.539979 7f52f516c700 15
> > > >>filestore(/data/osd.21) write meta/516b801c/pglog_2.1a/0//-1
> > > >>36015~147
> > > >> 2013-01-10 13:10:15.540016 7f52f516c700 15
> > > >>filestore(/data/osd.21)
> > > >>path: 
> > > >>/data/osd.21/current/meta/DIR_C/pglog\u2.1a__0_516B801C__none
> > > >> 2013-01-10 13:10:15.540164 7f52f516c700 15
> > > >>filestore(/data/osd.21) write meta/28d2f4a8/pginfo_2.1a/0//-1
> > > >>0~496
> > > >> 2013-01-10 13:10:15.540189 7f52f516c700 15
> > > >>filestore(/data/osd.21)
> > > >>path: 
> > > >>/data/osd.21/current/meta/DIR_8/pginfo\u2.1a__0_28D2F4A8__none
> > > >> 2013-01-10 13:10:15.540217 7f52f516c700 10
> > > >>filestore(/data/osd.21) _do_transaction on 0x327d708
> > > >> 2013-01-10 13:10:15.540222 7f52f516c700 15
> > > >>filestore(/data/osd.21) write
> > > >>2.1a_head/8abf341a/rb.0.106e.6b8b4567.0000000002d3/head//2
> > > >>3227648~524288
> > > >> 2013-01-10 13:10:15.540245 7f52f516c700 15
> > > >>filestore(/data/osd.21)
> > > >>path: 
> > > >>/data/osd.21/current/2.1a_head/rb.0.106e.6b8b4567.0000000002d3__
> > > >>he
> > > >>ad
> > > >>_8
> > > >>ABF341A__2
> > > >>l
> > > >>          If using XFS as backend file system and running xfs on top of 
> > > >> traditional sata disk, it will introduce a lot of seeks and therefore 
> > > >> reduce bandwidth, a blktrace is available here :( 
> > > >> http://ww3.sinaimg.cn/mw690/6e1aee47jw1e0qsbxbvddj.jpg) to demonstrate 
> > > >> this issue.( single client running dd on top of a new RBD volumes).
> > > >>          Then I tried to move /osd.X/current/meta to a separate disk, 
> > > >> the bandwidth boosted.(look blktrace at 
> > > >> http://ww4.sinaimg.cn/mw690/6e1aee47jw1e0qsadz1bij.jpg).
> > > >>          I haven't test other access pattern or something else, but it 
> > > >> looks to me that moving such meta to a separate disk (ssd or sata with 
> > > >> btrfs) will benefit ceph write performance, is it true? Will ceph 
> > > >> introduce this feature in the future?  Is there any potential problem 
> > > >> for such hack?
> > > >>
> > > >
> > > > Did you try putting XFS metadata log a separate and fast device 
> > > > (mkfs.xfs -l logdev=/dev/sdbx,size=10000b). I think it will 
> > > > boost performance too.
> > > >
> > > > Regards
> > > > Yan, Zheng
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> > > > in the body of a message to [email protected] More 
> > > > majordomo info at  http://vger.kernel.org/majordomo-info.html
> > > >
> > > 
> > > N????y????b?????v?????{.n??????z??ay????????j 
> > > ???f????????????????:+v???????? ??zZ+??????"?!?
> > N????y????b?????v?????{.n??????z??ay????????j 
> > ???f????????????????:+v???????? ??zZ+??????"?!?
>

Adding flashcache for data disk to cache Ceph metadata writes

Reply via email to