>>FWIW, the journal will coalesce writes quickly when there are many >>concurrent 4k client writes. Once you hit around 8 4k IOs per OSD, the >>journal will start coalescing. For say 100-150 IOPs (what a spinning >>disk can handle), expect around 9ish 100KB journal writes (with padding >>and header/footer for each client IO). What we've seen is that some >>drives that aren't that great at 4K O_DSYNC writes are still reasonably >>good with 8+ concurrent larger O_DSYNC writes.
Yes, indeed, it's not that bad. (hopefully ;) When benching the crucial m550, I only see time to time (maybe each 30s,don't remember exactly), ios slowing doing to 200 for 1 or 2 seconds then going up to normal speed around 4000iops. with intel s3500, I have constant write at 5000iops. BTW, does rbd client cache also help for coalescing write (client side), then help also the journal ? ----- Mail original ----- De: "Mark Nelson" <[email protected]> À: "Alexandre DERUMIER" <[email protected]>, "Xiaoxi Chen" <[email protected]> Cc: "Somnath Roy" <[email protected]>, "??" <[email protected]>, [email protected] Envoyé: Mercredi 17 Septembre 2014 17:01:06 Objet: Re: puzzled with the design pattern of ceph journal, really ruining performance On 09/17/2014 09:20 AM, Alexandre DERUMIER wrote: >>> 2. Have you got any data to prove the O_DSYNC or fdatasync kill the >>> performance of journal? In our previous test, the journal SSD (use a >>> partition of a SSD as a journal for a particular OSD, and 4 OSD share a >>> >>same SSD) could reach its peak performance (300-400MB/s) > > Hi, > > I have done some bench here: > > http://www.mail-archive.com/[email protected]/msg12950.html > > Some ssd models have really bad performance with O_DSYNC (crucial m550 - 312 > iops on 4k block). > Benching 1 osd,I can see big latencies for some seconds when O_DSYNC occur FWIW, the journal will coalesce writes quickly when there are many concurrent 4k client writes. Once you hit around 8 4k IOs per OSD, the journal will start coalescing. For say 100-150 IOPs (what a spinning disk can handle), expect around 9ish 100KB journal writes (with padding and header/footer for each client IO). What we've seen is that some drives that aren't that great at 4K O_DSYNC writes are still reasonably good with 8+ concurrent larger O_DSYNC writes. > > > > crucial m550 > ------------ > #fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2 > --group_reporting --invalidate=0 --name=ab --sync=1 > bw=1249.9KB/s, iops=312 > > intel s3500 > ----------- > fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2 > --group_reporting --invalidate=0 --name=ab --sync=1 > #bw=41794KB/s, iops=10448 > > ----- Mail original ----- > > De: "Xiaoxi Chen" <[email protected]> > À: "Somnath Roy" <[email protected]>, "??" <[email protected]>, > [email protected] > Envoyé: Mercredi 17 Septembre 2014 09:59:37 > Objet: RE: puzzled with the design pattern of ceph journal, really ruining > performance > > Hi Nicheal, > > 1. The main purpose of journal is provide transaction semantics (prevent > partially update). Peer is not enough for this need because ceph writes all > replica at the same time, so when crush, you have no idea about which replica > has right data. For example, say if we have 2 replica, user update a 4M > object and the primary OSD crush when the first 2M was written , secondary > OSD may also failed when the first 3MB was written. So both versions in > primary/secondary are neither the new value, nor the old value, and have no > way to recover. So share the same idea as database, we need to have a journal > to support transaction and prevent this happen. For some backend support > transaction, BTRFS as an instance, we don't need a journal, we can write the > journal and data disk at the same time, the journal here is just try to help > performance, since it only do sequential write and we suspect it should be > faster than backend OSD. > > 2. Have you got any data to prove the O_DSYNC or fdatasync kill the > performance of journal? In our previous test, the journal SSD (use a > partition of a SSD as a journal for a particular OSD, and 4 OSD share a same > SSD) could reach its peak performance (300-400MB/s) > > Xiaoxi > > > > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of Somnath Roy > Sent: Wednesday, September 17, 2014 3:30 PM > To: 姚宁; [email protected] > Subject: RE: puzzled with the design pattern of ceph journal, really ruining > performance > > Hi Nicheal, > Not only recovery , IMHO the main purpose of ceph journal is to support > transaction semantics since XFS doesn't have that. I guess it can't be > achieved with pg_log/pg_info. > > Thanks & Regards > Somnath > > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of ?? > Sent: Tuesday, September 16, 2014 11:29 PM > To: [email protected] > Subject: puzzled with the design pattern of ceph journal, really ruining > performance > > Hi, guys > > I analyze the architecture of the ceph souce code. > > I know that, in order to keep journal atomic and consistent, the journal > write mode should be set with O_DSYNC or called fdatasync() system call after > every write operation. However, this kind of operation is really killing the > performance as well as achieving high committing latency, even if SSD is used > as journal disk. If the SSD has capacitor to keep the data safe when the > system crashes, we can set the mount option nobarrier or SSD itself will > ignore the FLUSH REQUEST. So the performance would be better. > > So can it be instead by other strategies? > As far as I am concerned, I think the most important part is pg_log and > pg_info. It will guides the crashed osd recovery its objects from the peers. > Therefore, if we can keep pg_log at a consistent point, we can recovery data > without journal. So can we just use an "undo" > strategy on pg_log and neglect ceph journal? It will save lots of bandwidth, > and also based on the consistent pg_log epoch, we can always recovery data > from its peering osd, right? But this will lead to recovery more objects if > the osd crash. > > Nicheal > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
