On Mon, 14 Jan 2013, Chen, Xiaoxi wrote: > Hi Sage, > Thanks for your mail~ > Would you have a timetable about when such improvement can be > ready?It's critical for non-btrfs filesystem. > I am thinking about introducing flashcache into my configuration to > cache such meta write, since flashcache working under the filesystem, I > suppose it will not break the assumption inside Ceph. I will try it on > tomorrow and come back with you ~ > Thanks again for the helps!
I think avoiding the pginfo change may be pretty simple. The log one I am a bit less concerned about (the appends from many rbd IOs will get aggregated into a small number of XFS IOs), and changing that around would be a bigger deal. sage > > > Xiaoxi > > -----Original Message----- > From: Sage Weil [mailto:[email protected]] > Sent: 2013?1?13? 0:57 > To: Chen, Xiaoxi > Cc: Mark Nelson; Yan, Zheng ; [email protected] > Subject: RE: Seperate metadata disk for OSD > > On Sat, 12 Jan 2013, Chen, Xiaoxi wrote: > > Hi Zheng? > > I have put XFS log to a separate disk, indeed it provide some > > performance gain but not that significant. > > Ceph's metadata is somehow separate(it's some files reside in OSD's > > disk), therefore,it cannot be helped by neither XFS journal log nor OSD's > > journal.That's why I am trying to put ceph's metadata(/data/osd.x/meta > > folder ) to a separate SSD disk. > > To Nelson, > > I did the experiment with just 1 client, if using more clients, the > > gain will not be that much. > > It looks to me that a single write from client side become 3 writes to > > disk is somehow a big overhead for in-place-update filesystem such like XFS > > since it introduce more seeks.Out-of-place-update filesystem will not > > suffer a lot for such pattern,I didn?t find this problem when I using BTRFS > > as backend filesystem. But forBTRFS, fragmentation is another performance > > killer, for a single RBD volume, if you did a lot of random write on it, > > the sequential read performance will drop to 30% of a new RBD volume. This > > make BTRFS unusable in production. > > Separate Ceph meta seems quite easy to me ( I just mount a partition to > > /data/osd.X/meta), is it right ? is there any potential problem in it? > > Unfortunately, yes. The ceph journal and fs sync are carefully timed. > The ceph-osd assumes that syncfs(2) on the $osd_data/current/commit_op_seq > file will sync everything, but if meta/ is another fs that isnt true. At the > every least, the code needs to be modified to sync that as well. > > That said, there is a lot of improvement that can be had here. The three > things we write are: > > the pg log > the pg info, spread across the pg dir xattr and that pginfo file the actual > io > > The pg log could go in leveldb, which would translate those writes into a > single sequential stream across the entire OSD. And the PG info separate > between the xattr and the file is far from optimal: most of that data doesn't > actually change on each write. What little does is very small, and could be > moved into the xattr, avoiding touching the file (which means an inode + data > block write) at all. > > We need to look a bit more closely to see how difficult that will really be > to implement, but I think it is promising! > > sage > > > > > > > > > > > > Xiaoxi > > -----Original Message----- > > From: Mark Nelson [mailto:[email protected]] > > Sent: 2013?1?12? 21:36 > > To: Yan, Zheng > > Cc: Chen, Xiaoxi; [email protected] > > Subject: Re: Seperate metadata disk for OSD > > > > Hi Xiaoxi and Zheng, > > > > We've played with both of these some internally, but not for a production > > deployment. Mostly just for diagnosing performance problems. > > It's been a while since I last played with this, but I hadn't seen a > > whole lot of performance improvements at the time. That may have been due > > to the hardware in use, or perhaps other parts of Ceph have improved to the > > point where this matters now! > > > > On a side note, Btrfs also had a google summer of code project to let you > > put metadata on an external device. Originally I think that was supposed > > to make it into 3.7, but am not sure if that happened. > > > > Mark > > > > On 01/12/2013 06:21 AM, Yan, Zheng wrote: > > > On Sat, Jan 12, 2013 at 2:57 PM, Chen, Xiaoxi <[email protected]> > > > wrote: > > >> > > >> Hi list, > > >> For a rbd write request, Ceph need to do 3 writes: > > >> 2013-01-10 13:10:15.539967 7f52f516c700 10 filestore(/data/osd.21) > > >>_do_transaction on 0x327d790 > > >> 2013-01-10 13:10:15.539979 7f52f516c700 15 filestore(/data/osd.21) > > >>write meta/516b801c/pglog_2.1a/0//-1 36015~147 > > >> 2013-01-10 13:10:15.540016 7f52f516c700 15 filestore(/data/osd.21) > > >>path: /data/osd.21/current/meta/DIR_C/pglog\u2.1a__0_516B801C__none > > >> 2013-01-10 13:10:15.540164 7f52f516c700 15 filestore(/data/osd.21) > > >>write meta/28d2f4a8/pginfo_2.1a/0//-1 0~496 > > >> 2013-01-10 13:10:15.540189 7f52f516c700 15 filestore(/data/osd.21) > > >>path: /data/osd.21/current/meta/DIR_8/pginfo\u2.1a__0_28D2F4A8__none > > >> 2013-01-10 13:10:15.540217 7f52f516c700 10 filestore(/data/osd.21) > > >>_do_transaction on 0x327d708 > > >> 2013-01-10 13:10:15.540222 7f52f516c700 15 filestore(/data/osd.21) > > >>write 2.1a_head/8abf341a/rb.0.106e.6b8b4567.0000000002d3/head//2 > > >>3227648~524288 > > >> 2013-01-10 13:10:15.540245 7f52f516c700 15 filestore(/data/osd.21) > > >>path: > > >>/data/osd.21/current/2.1a_head/rb.0.106e.6b8b4567.0000000002d3__head > > >>_8 > > >>ABF341A__2 > > >>l > > >> If using XFS as backend file system and running xfs on top of > > >> traditional sata disk, it will introduce a lot of seeks and therefore > > >> reduce bandwidth, a blktrace is available here :( > > >> http://ww3.sinaimg.cn/mw690/6e1aee47jw1e0qsbxbvddj.jpg) to demonstrate > > >> this issue.( single client running dd on top of a new RBD volumes). > > >> Then I tried to move /osd.X/current/meta to a separate disk, > > >> the bandwidth boosted.(look blktrace at > > >> http://ww4.sinaimg.cn/mw690/6e1aee47jw1e0qsadz1bij.jpg). > > >> I haven't test other access pattern or something else, but it > > >> looks to me that moving such meta to a separate disk (ssd or sata with > > >> btrfs) will benefit ceph write performance, is it true? Will ceph > > >> introduce this feature in the future? Is there any potential problem > > >> for such hack? > > >> > > > > > > Did you try putting XFS metadata log a separate and fast device > > > (mkfs.xfs -l logdev=/dev/sdbx,size=10000b). I think it will boost > > > performance too. > > > > > > Regards > > > Yan, Zheng > > > -- > > > To unsubscribe from this list: send the line "unsubscribe ceph-devel" > > > in the body of a message to [email protected] More majordomo > > > info at http://vger.kernel.org/majordomo-info.html > > > > > > > N????y????b?????v?????{.n??????z??ay????????j > > ???f????????????????:+v???????? ??zZ+??????"?!? > N????y????b?????v?????{.n??????z??ay????????j???f????????????????:+v??????????zZ+??????"?!? -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
