Hi List,
I have introduced flashcache (https://github.com/facebook/flashcache)
aim at reduce Ceph metadata IOs to OSD's disk. Basically, for every data
writes, ceph need to write 3 things:
Pg log
Pg info
Actual data
First 2 requests are small, but for non-btrfs filesystem, the first 2
writes will results OSD disk to do 2 seeks, it's critical to spindle-disk's
throughput as mentioned in earlier mail.
I list the detail of my experiment , any inputs is highly appreciate.
[Setup]
2 Host, 1 SSD with 1 SATA disk, SSD is partitioned into 4 partitions.
P1 as OSD Journal, P2 as FlashCache for sata. P3 used as XFS metadata journal.
1 Client,1 RBD Volume created and mounted
[FlashCache setup]
[Create cached device]
flashcache_create -v -p back fsdc /dev/sda2 /dev/sdc
[Create filesystem]
mkfs.xfs -f -i size=2048 -d agcount=1 -l
logdev=/dev/sda3,size=128m /dev/mapper/fsdc
[Mount]
mount -o logdev=/dev/sda3 -o logbsize=256k -o delaylog
-o inode64 /dev/mapper/fsdc /data/osd.21/
[Tuning]
sysctl dev.flashcache.sda9+sdc.skip_seq_thresh_kb=32
Since I am aiming to cache only ceph metadata and the
metadata writes are very small, so I configured flashcache to skip all
sequential write larger than 32K. Basically you can set this to 1K because the
meta writes are all less than 1K. I set it to 32K just for a
quick test.
[Experiment]
Doing dd from the client on top of the RBD Volume
[Result]
Throughput boost from 37MB/s to ~ 90MB/s, since the flashcache
working in DM level, it's transparent to Ceph.
My test is just a quick test, further test (include sequential
R/W,random R/W) are in schedule. Will come back with you if there are some
progress.
Xiaoxi
-----Original Message-----
From: Sage Weil [mailto:[email protected]]
Sent: 2013年1月16日 5:43
To: Chen, Xiaoxi
Cc: Mark Nelson; Yan, Zheng
Subject: RE: Seperate metadata disk for OSD
On Tue, 15 Jan 2013, Chen, Xiaoxi wrote:
> Hi Sage,
> FlashCache works well for this scenarios, I created a hybrid-disk with
> 1 ssd partition(shared the same ssd but different patition with Ceph journal
> and XFS journal) and 1 sata disk.Configured the FlashCache to ignore all
> sequential request larger than 32K(Well, it can be set to a smaller number).
> The results shows a comparable performance with CephMeta-to-ssd
> solution.
> Since flashcache working in the DM layer , I suppose it's transparent
> to Ceph, right?
Right. That's great to hear that it works well. If you don't mind, it would
be great if you could report the same thing to ceph-devel with a bit of detail
about how you configured FlashCache so that others can do the same.
Thanks!
sage
>
>
> Xiaoxi
>
> -----Original Message-----
> From: Sage Weil [mailto:[email protected]]
> Sent: 2013?1?15? 2:19
> To: Chen, Xiaoxi
> Cc: Mark Nelson; Yan, Zheng ; [email protected]
> Subject: RE: Seperate metadata disk for OSD
>
> On Mon, 14 Jan 2013, Chen, Xiaoxi wrote:
> > Hi Sage,
> > Thanks for your mail~
> > Would you have a timetable about when such improvement can be
> > ready?It's critical for non-btrfs filesystem.
> > I am thinking about introducing flashcache into my configuration to
> > cache such meta write, since flashcache working under the filesystem, I
> > suppose it will not break the assumption inside Ceph. I will try it on
> > tomorrow and come back with you ~
> > Thanks again for the helps!
>
> I think avoiding the pginfo change may be pretty simple. The log one I am a
> bit less concerned about (the appends from many rbd IOs will get aggregated
> into a small number of XFS IOs), and changing that around would be a bigger
> deal.
>
> sage
>
>
> >
> >
> > Xiaoxi
> >
> > -----Original Message-----
> > From: Sage Weil [mailto:[email protected]]
> > Sent: 2013?1?13? 0:57
> > To: Chen, Xiaoxi
> > Cc: Mark Nelson; Yan, Zheng ; [email protected]
> > Subject: RE: Seperate metadata disk for OSD
> >
> > On Sat, 12 Jan 2013, Chen, Xiaoxi wrote:
> > > Hi Zheng?
> > > I have put XFS log to a separate disk, indeed it provide some
> > > performance gain but not that significant.
> > > Ceph's metadata is somehow separate(it's some files reside in OSD's
> > > disk), therefore,it cannot be helped by neither XFS journal log nor OSD's
> > > journal.That's why I am trying to put ceph's metadata(/data/osd.x/meta
> > > folder ) to a separate SSD disk.
> > > To Nelson,
> > > I did the experiment with just 1 client, if using more clients, the
> > > gain will not be that much.
> > > It looks to me that a single write from client side become 3 writes to
> > > disk is somehow a big overhead for in-place-update filesystem such like
> > > XFS since it introduce more seeks.Out-of-place-update filesystem will not
> > > suffer a lot for such pattern,I didn?t find this problem when I using
> > > BTRFS as backend filesystem. But forBTRFS, fragmentation is another
> > > performance killer, for a single RBD volume, if you did a lot of random
> > > write on it, the sequential read performance will drop to 30% of a new
> > > RBD volume. This make BTRFS unusable in production.
> > > Separate Ceph meta seems quite easy to me ( I just mount a partition to
> > > /data/osd.X/meta), is it right ? is there any potential problem in it?
> >
> > Unfortunately, yes. The ceph journal and fs sync are carefully timed.
> > The ceph-osd assumes that syncfs(2) on the $osd_data/current/commit_op_seq
> > file will sync everything, but if meta/ is another fs that isnt true. At
> > the every least, the code needs to be modified to sync that as well.
> >
> > That said, there is a lot of improvement that can be had here. The three
> > things we write are:
> >
> > the pg log
> > the pg info, spread across the pg dir xattr and that pginfo file
> > the actual io
> >
> > The pg log could go in leveldb, which would translate those writes into a
> > single sequential stream across the entire OSD. And the PG info separate
> > between the xattr and the file is far from optimal: most of that data
> > doesn't actually change on each write. What little does is very small, and
> > could be moved into the xattr, avoiding touching the file (which means an
> > inode + data block write) at all.
> >
> > We need to look a bit more closely to see how difficult that will really be
> > to implement, but I think it is promising!
> >
> > sage
> >
> >
> > >
> > >
> > >
> > >
> > > Xiaoxi -----Original Message-----
> > > From: Mark Nelson [mailto:[email protected]]
> > > Sent: 2013?1?12? 21:36
> > > To: Yan, Zheng
> > > Cc: Chen, Xiaoxi; [email protected]
> > > Subject: Re: Seperate metadata disk for OSD
> > >
> > > Hi Xiaoxi and Zheng,
> > >
> > > We've played with both of these some internally, but not for a production
> > > deployment. Mostly just for diagnosing performance problems.
> > > It's been a while since I last played with this, but I hadn't seen a
> > > whole lot of performance improvements at the time. That may have been
> > > due to the hardware in use, or perhaps other parts of Ceph have improved
> > > to the point where this matters now!
>
> > >
> > > On a side note, Btrfs also had a google summer of code project to let you
> > > put metadata on an external device. Originally I think that was supposed
> > > to make it into 3.7, but am not sure if that happened.
> > >
> > > Mark
> > >
> > > On 01/12/2013 06:21 AM, Yan, Zheng wrote:
> > > > On Sat, Jan 12, 2013 at 2:57 PM, Chen, Xiaoxi <[email protected]>
> > > > wrote:
> > > >>
> > > >> Hi list,
> > > >> For a rbd write request, Ceph need to do 3 writes:
> > > >> 2013-01-10 13:10:15.539967 7f52f516c700 10
> > > >>filestore(/data/osd.21) _do_transaction on 0x327d790
> > > >> 2013-01-10 13:10:15.539979 7f52f516c700 15
> > > >>filestore(/data/osd.21) write meta/516b801c/pglog_2.1a/0//-1
> > > >>36015~147
> > > >> 2013-01-10 13:10:15.540016 7f52f516c700 15
> > > >>filestore(/data/osd.21)
> > > >>path:
> > > >>/data/osd.21/current/meta/DIR_C/pglog\u2.1a__0_516B801C__none
> > > >> 2013-01-10 13:10:15.540164 7f52f516c700 15
> > > >>filestore(/data/osd.21) write meta/28d2f4a8/pginfo_2.1a/0//-1
> > > >>0~496
> > > >> 2013-01-10 13:10:15.540189 7f52f516c700 15
> > > >>filestore(/data/osd.21)
> > > >>path:
> > > >>/data/osd.21/current/meta/DIR_8/pginfo\u2.1a__0_28D2F4A8__none
> > > >> 2013-01-10 13:10:15.540217 7f52f516c700 10
> > > >>filestore(/data/osd.21) _do_transaction on 0x327d708
> > > >> 2013-01-10 13:10:15.540222 7f52f516c700 15
> > > >>filestore(/data/osd.21) write
> > > >>2.1a_head/8abf341a/rb.0.106e.6b8b4567.0000000002d3/head//2
> > > >>3227648~524288
> > > >> 2013-01-10 13:10:15.540245 7f52f516c700 15
> > > >>filestore(/data/osd.21)
> > > >>path:
> > > >>/data/osd.21/current/2.1a_head/rb.0.106e.6b8b4567.0000000002d3__
> > > >>he
> > > >>ad
> > > >>_8
> > > >>ABF341A__2
> > > >>l
> > > >> If using XFS as backend file system and running xfs on top of
> > > >> traditional sata disk, it will introduce a lot of seeks and therefore
> > > >> reduce bandwidth, a blktrace is available here :(
> > > >> http://ww3.sinaimg.cn/mw690/6e1aee47jw1e0qsbxbvddj.jpg) to demonstrate
> > > >> this issue.( single client running dd on top of a new RBD volumes).
> > > >> Then I tried to move /osd.X/current/meta to a separate disk,
> > > >> the bandwidth boosted.(look blktrace at
> > > >> http://ww4.sinaimg.cn/mw690/6e1aee47jw1e0qsadz1bij.jpg).
> > > >> I haven't test other access pattern or something else, but it
> > > >> looks to me that moving such meta to a separate disk (ssd or sata with
> > > >> btrfs) will benefit ceph write performance, is it true? Will ceph
> > > >> introduce this feature in the future? Is there any potential problem
> > > >> for such hack?
> > > >>
> > > >
> > > > Did you try putting XFS metadata log a separate and fast device
> > > > (mkfs.xfs -l logdev=/dev/sdbx,size=10000b). I think it will
> > > > boost performance too.
> > > >
> > > > Regards
> > > > Yan, Zheng
> > > > --
> > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> > > > in the body of a message to [email protected] More
> > > > majordomo info at http://vger.kernel.org/majordomo-info.html
> > > >
> > >
> > > N????y????b?????v?????{.n??????z??ay????????j
> > > ???f????????????????:+v???????? ??zZ+??????"?!?
> > N????y????b?????v?????{.n??????z??ay????????j
> > ???f????????????????:+v???????? ??zZ+??????"?!?
>