Hi Xiaoxi, That's fantastic! This is really exciting! Congrats on getting this working. Once I have free time again I will try to replicate your findings. Again, great job!
Mark On 01/15/2013 10:00 PM, Chen, Xiaoxi wrote: > Hi List, > I have introduced flashcache (https://github.com/facebook/flashcache) > aim at reduce Ceph metadata IOs to OSD's disk. Basically, for every data > writes, ceph need to write 3 things: > Pg log > Pg info > Actual data > First 2 requests are small, but for non-btrfs filesystem, the first 2 > writes will results OSD disk to do 2 seeks, it's critical to spindle-disk's > throughput as mentioned in earlier mail. > > I list the detail of my experiment , any inputs is highly appreciate. > > [Setup] > 2 Host, 1 SSD with 1 SATA disk, SSD is partitioned into 4 partitions. > P1 as OSD Journal, P2 as FlashCache for sata. P3 used as XFS metadata journal. > 1 Client,1 RBD Volume created and mounted > [FlashCache setup] > [Create cached device] > flashcache_create -v -p back fsdc /dev/sda2 /dev/sdc > [Create filesystem] > mkfs.xfs -f -i size=2048 -d agcount=1 -l > logdev=/dev/sda3,size=128m /dev/mapper/fsdc > [Mount] > mount -o logdev=/dev/sda3 -o logbsize=256k -o delaylog > -o inode64 /dev/mapper/fsdc /data/osd.21/ > [Tuning] > sysctl dev.flashcache.sda9+sdc.skip_seq_thresh_kb=32 > > Since I am aiming to cache only ceph metadata and the > metadata writes are very small, so I configured flashcache to skip all > sequential write larger than 32K. Basically you can set this to 1K because > the meta writes are all less than 1K. I set it to 32K > just for a quick test. > [Experiment] > Doing dd from the client on top of the RBD Volume > [Result] > Throughput boost from 37MB/s to ~ 90MB/s, since the flashcache > working in DM level, it's transparent to Ceph. > > My test is just a quick test, further test (include sequential > R/W,random R/W) are in schedule. Will come back with you if there are some > progress. > > > > > Xiaoxi > > -----Original Message----- > From: Sage Weil [mailto:[email protected]] > Sent: 2013年1月16日 5:43 > To: Chen, Xiaoxi > Cc: Mark Nelson; Yan, Zheng > Subject: RE: Seperate metadata disk for OSD > > On Tue, 15 Jan 2013, Chen, Xiaoxi wrote: >> Hi Sage, >> FlashCache works well for this scenarios, I created a hybrid-disk with >> 1 ssd partition(shared the same ssd but different patition with Ceph journal >> and XFS journal) and 1 sata disk.Configured the FlashCache to ignore all >> sequential request larger than 32K(Well, it can be set to a smaller number). >> The results shows a comparable performance with CephMeta-to-ssd >> solution. >> Since flashcache working in the DM layer , I suppose it's transparent >> to Ceph, right? > > Right. That's great to hear that it works well. If you don't mind, it would > be great if you could report the same thing to ceph-devel with a bit of > detail about how you configured FlashCache so that others can do the same. > > Thanks! > sage > > >> >> >> Xiaoxi >> >> -----Original Message----- >> From: Sage Weil [mailto:[email protected]] >> Sent: 2013?1?15? 2:19 >> To: Chen, Xiaoxi >> Cc: Mark Nelson; Yan, Zheng ; [email protected] >> Subject: RE: Seperate metadata disk for OSD >> >> On Mon, 14 Jan 2013, Chen, Xiaoxi wrote: >>> Hi Sage, >>> Thanks for your mail~ >>> Would you have a timetable about when such improvement can be >>> ready?It's critical for non-btrfs filesystem. >>> I am thinking about introducing flashcache into my configuration to >>> cache such meta write, since flashcache working under the filesystem, I >>> suppose it will not break the assumption inside Ceph. I will try it on >>> tomorrow and come back with you ~ >>> Thanks again for the helps! >> >> I think avoiding the pginfo change may be pretty simple. The log one I am a >> bit less concerned about (the appends from many rbd IOs will get aggregated >> into a small number of XFS IOs), and changing that around would be a bigger >> deal. >> >> sage >> >> >>> >>> >>> Xiaoxi >>> >>> -----Original Message----- >>> From: Sage Weil [mailto:[email protected]] >>> Sent: 2013?1?13? 0:57 >>> To: Chen, Xiaoxi >>> Cc: Mark Nelson; Yan, Zheng ; [email protected] >>> Subject: RE: Seperate metadata disk for OSD >>> >>> On Sat, 12 Jan 2013, Chen, Xiaoxi wrote: >>>> Hi Zheng? >>>> I have put XFS log to a separate disk, indeed it provide some >>>> performance gain but not that significant. >>>> Ceph's metadata is somehow separate(it's some files reside in OSD's >>>> disk), therefore,it cannot be helped by neither XFS journal log nor OSD's >>>> journal.That's why I am trying to put ceph's metadata(/data/osd.x/meta >>>> folder ) to a separate SSD disk. >>>> To Nelson, >>>> I did the experiment with just 1 client, if using more clients, the >>>> gain will not be that much. >>>> It looks to me that a single write from client side become 3 writes to >>>> disk is somehow a big overhead for in-place-update filesystem such like >>>> XFS since it introduce more seeks.Out-of-place-update filesystem will not >>>> suffer a lot for such pattern,I didn?t find this problem when I using >>>> BTRFS as backend filesystem. But forBTRFS, fragmentation is another >>>> performance killer, for a single RBD volume, if you did a lot of random >>>> write on it, the sequential read performance will drop to 30% of a new RBD >>>> volume. This make BTRFS unusable in production. >>>> Separate Ceph meta seems quite easy to me ( I just mount a partition to >>>> /data/osd.X/meta), is it right ? is there any potential problem in it? >>> >>> Unfortunately, yes. The ceph journal and fs sync are carefully timed. >>> The ceph-osd assumes that syncfs(2) on the $osd_data/current/commit_op_seq >>> file will sync everything, but if meta/ is another fs that isnt true. At >>> the every least, the code needs to be modified to sync that as well. >>> >>> That said, there is a lot of improvement that can be had here. The three >>> things we write are: >>> >>> the pg log >>> the pg info, spread across the pg dir xattr and that pginfo file >>> the actual io >>> >>> The pg log could go in leveldb, which would translate those writes into a >>> single sequential stream across the entire OSD. And the PG info separate >>> between the xattr and the file is far from optimal: most of that data >>> doesn't actually change on each write. What little does is very small, and >>> could be moved into the xattr, avoiding touching the file (which means an >>> inode + data block write) at all. >>> >>> We need to look a bit more closely to see how difficult that will really be >>> to implement, but I think it is promising! >>> >>> sage >>> >>> >>>> >>>> >>>> >>>> >>>> Xiaoxi -----Original Message----- >>>> From: Mark Nelson [mailto:[email protected]] >>>> Sent: 2013?1?12? 21:36 >>>> To: Yan, Zheng >>>> Cc: Chen, Xiaoxi; [email protected] >>>> Subject: Re: Seperate metadata disk for OSD >>>> >>>> Hi Xiaoxi and Zheng, >>>> >>>> We've played with both of these some internally, but not for a production >>>> deployment. Mostly just for diagnosing performance problems. >>>> It's been a while since I last played with this, but I hadn't seen a >>>> whole lot of performance improvements at the time. That may have been due >>>> to the hardware in use, or perhaps other parts of Ceph have improved to >>>> the point where this matters now! >> >>>> >>>> On a side note, Btrfs also had a google summer of code project to let you >>>> put metadata on an external device. Originally I think that was supposed >>>> to make it into 3.7, but am not sure if that happened. >>>> >>>> Mark >>>> >>>> On 01/12/2013 06:21 AM, Yan, Zheng wrote: >>>>> On Sat, Jan 12, 2013 at 2:57 PM, Chen, Xiaoxi <[email protected]> >>>>> wrote: >>>>>> >>>>>> Hi list, >>>>>> For a rbd write request, Ceph need to do 3 writes: >>>>>> 2013-01-10 13:10:15.539967 7f52f516c700 10 >>>>>> filestore(/data/osd.21) _do_transaction on 0x327d790 >>>>>> 2013-01-10 13:10:15.539979 7f52f516c700 15 >>>>>> filestore(/data/osd.21) write meta/516b801c/pglog_2.1a/0//-1 >>>>>> 36015~147 >>>>>> 2013-01-10 13:10:15.540016 7f52f516c700 15 >>>>>> filestore(/data/osd.21) >>>>>> path: >>>>>> /data/osd.21/current/meta/DIR_C/pglog\u2.1a__0_516B801C__none >>>>>> 2013-01-10 13:10:15.540164 7f52f516c700 15 >>>>>> filestore(/data/osd.21) write meta/28d2f4a8/pginfo_2.1a/0//-1 >>>>>> 0~496 >>>>>> 2013-01-10 13:10:15.540189 7f52f516c700 15 >>>>>> filestore(/data/osd.21) >>>>>> path: >>>>>> /data/osd.21/current/meta/DIR_8/pginfo\u2.1a__0_28D2F4A8__none >>>>>> 2013-01-10 13:10:15.540217 7f52f516c700 10 >>>>>> filestore(/data/osd.21) _do_transaction on 0x327d708 >>>>>> 2013-01-10 13:10:15.540222 7f52f516c700 15 >>>>>> filestore(/data/osd.21) write >>>>>> 2.1a_head/8abf341a/rb.0.106e.6b8b4567.0000000002d3/head//2 >>>>>> 3227648~524288 >>>>>> 2013-01-10 13:10:15.540245 7f52f516c700 15 >>>>>> filestore(/data/osd.21) >>>>>> path: >>>>>> /data/osd.21/current/2.1a_head/rb.0.106e.6b8b4567.0000000002d3__ >>>>>> he >>>>>> ad >>>>>> _8 >>>>>> ABF341A__2 >>>>>> l >>>>>> If using XFS as backend file system and running xfs on top of >>>>>> traditional sata disk, it will introduce a lot of seeks and therefore >>>>>> reduce bandwidth, a blktrace is available here :( >>>>>> http://ww3.sinaimg.cn/mw690/6e1aee47jw1e0qsbxbvddj.jpg) to demonstrate >>>>>> this issue.( single client running dd on top of a new RBD volumes). >>>>>> Then I tried to move /osd.X/current/meta to a separate disk, >>>>>> the bandwidth boosted.(look blktrace at >>>>>> http://ww4.sinaimg.cn/mw690/6e1aee47jw1e0qsadz1bij.jpg). >>>>>> I haven't test other access pattern or something else, but it >>>>>> looks to me that moving such meta to a separate disk (ssd or sata with >>>>>> btrfs) will benefit ceph write performance, is it true? Will ceph >>>>>> introduce this feature in the future? Is there any potential problem >>>>>> for such hack? >>>>>> >>>>> >>>>> Did you try putting XFS metadata log a separate and fast device >>>>> (mkfs.xfs -l logdev=/dev/sdbx,size=10000b). I think it will >>>>> boost performance too. >>>>> >>>>> Regards >>>>> Yan, Zheng >>>>> -- >>>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >>>>> in the body of a message to [email protected] More >>>>> majordomo info at http://vger.kernel.org/majordomo-info.html >>>>> >>>> >>>> N????y????b?????v?????{.n??????z??ay????????j >>>> ???f????????????????:+v???????? ??zZ+??????"?!? >>> N????y????b?????v?????{.n??????z??ay????????j >>> ???f????????????????:+v???????? ??zZ+??????"?!? >> > N�Р骒r��y����b�X�肚�v�^�)藓{.n�+���z�]z鳐�{ay������,j��f"�h���z��wア� > ⒎�j:+v���w�j�m������赙zZ+�����茛j"��!tml= > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
