>>FWIW, the journal will coalesce writes quickly when there are many 
>>concurrent 4k client writes.  Once you hit around 8 4k IOs per OSD, the 
>>journal will start coalescing.  For say 100-150 IOPs (what a spinning 
>>disk can handle), expect around 9ish 100KB journal writes (with padding 
>>and header/footer for each client IO).  What we've seen is that some 
>>drives that aren't that great at 4K O_DSYNC writes are still reasonably 
>>good with 8+ concurrent larger O_DSYNC writes.

Yes, indeed, it's not that bad. (hopefully ;)

When benching the crucial m550, I only see time to time (maybe each 30s,don't 
remember exactly), ios slowing doing to 200 for 1 or 2 seconds
then going up to normal speed around 4000iops.

with intel s3500, I have constant write at 5000iops.


BTW, does rbd client cache also help for coalescing write (client side), then 
help also the journal ?





----- Mail original ----- 

De: "Mark Nelson" <[email protected]> 
À: "Alexandre DERUMIER" <[email protected]>, "Xiaoxi Chen" 
<[email protected]> 
Cc: "Somnath Roy" <[email protected]>, "??" <[email protected]>, 
[email protected] 
Envoyé: Mercredi 17 Septembre 2014 17:01:06 
Objet: Re: puzzled with the design pattern of ceph journal, really ruining 
performance 

On 09/17/2014 09:20 AM, Alexandre DERUMIER wrote: 
>>> 2. Have you got any data to prove the O_DSYNC or fdatasync kill the 
>>> performance of journal? In our previous test, the journal SSD (use a 
>>> partition of a SSD as a journal for a particular OSD, and 4 OSD share a 
>>> >>same SSD) could reach its peak performance (300-400MB/s) 
> 
> Hi, 
> 
> I have done some bench here: 
> 
> http://www.mail-archive.com/[email protected]/msg12950.html 
> 
> Some ssd models have really bad performance with O_DSYNC (crucial m550 - 312 
> iops on 4k block). 
> Benching 1 osd,I can see big latencies for some seconds when O_DSYNC occur 

FWIW, the journal will coalesce writes quickly when there are many 
concurrent 4k client writes. Once you hit around 8 4k IOs per OSD, the 
journal will start coalescing. For say 100-150 IOPs (what a spinning 
disk can handle), expect around 9ish 100KB journal writes (with padding 
and header/footer for each client IO). What we've seen is that some 
drives that aren't that great at 4K O_DSYNC writes are still reasonably 
good with 8+ concurrent larger O_DSYNC writes. 

> 
> 
> 
> crucial m550 
> ------------ 
> #fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2 
> --group_reporting --invalidate=0 --name=ab --sync=1 
> bw=1249.9KB/s, iops=312 
> 
> intel s3500 
> ----------- 
> fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2 
> --group_reporting --invalidate=0 --name=ab --sync=1 
> #bw=41794KB/s, iops=10448 
> 
> ----- Mail original ----- 
> 
> De: "Xiaoxi Chen" <[email protected]> 
> À: "Somnath Roy" <[email protected]>, "??" <[email protected]>, 
> [email protected] 
> Envoyé: Mercredi 17 Septembre 2014 09:59:37 
> Objet: RE: puzzled with the design pattern of ceph journal, really ruining 
> performance 
> 
> Hi Nicheal, 
> 
> 1. The main purpose of journal is provide transaction semantics (prevent 
> partially update). Peer is not enough for this need because ceph writes all 
> replica at the same time, so when crush, you have no idea about which replica 
> has right data. For example, say if we have 2 replica, user update a 4M 
> object and the primary OSD crush when the first 2M was written , secondary 
> OSD may also failed when the first 3MB was written. So both versions in 
> primary/secondary are neither the new value, nor the old value, and have no 
> way to recover. So share the same idea as database, we need to have a journal 
> to support transaction and prevent this happen. For some backend support 
> transaction, BTRFS as an instance, we don't need a journal, we can write the 
> journal and data disk at the same time, the journal here is just try to help 
> performance, since it only do sequential write and we suspect it should be 
> faster than backend OSD. 
> 
> 2. Have you got any data to prove the O_DSYNC or fdatasync kill the 
> performance of journal? In our previous test, the journal SSD (use a 
> partition of a SSD as a journal for a particular OSD, and 4 OSD share a same 
> SSD) could reach its peak performance (300-400MB/s) 
> 
> Xiaoxi 
> 
> 
> 
> -----Original Message----- 
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of Somnath Roy 
> Sent: Wednesday, September 17, 2014 3:30 PM 
> To: 姚宁; [email protected] 
> Subject: RE: puzzled with the design pattern of ceph journal, really ruining 
> performance 
> 
> Hi Nicheal, 
> Not only recovery , IMHO the main purpose of ceph journal is to support 
> transaction semantics since XFS doesn't have that. I guess it can't be 
> achieved with pg_log/pg_info. 
> 
> Thanks & Regards 
> Somnath 
> 
> -----Original Message----- 
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of ?? 
> Sent: Tuesday, September 16, 2014 11:29 PM 
> To: [email protected] 
> Subject: puzzled with the design pattern of ceph journal, really ruining 
> performance 
> 
> Hi, guys 
> 
> I analyze the architecture of the ceph souce code. 
> 
> I know that, in order to keep journal atomic and consistent, the journal 
> write mode should be set with O_DSYNC or called fdatasync() system call after 
> every write operation. However, this kind of operation is really killing the 
> performance as well as achieving high committing latency, even if SSD is used 
> as journal disk. If the SSD has capacitor to keep the data safe when the 
> system crashes, we can set the mount option nobarrier or SSD itself will 
> ignore the FLUSH REQUEST. So the performance would be better. 
> 
> So can it be instead by other strategies? 
> As far as I am concerned, I think the most important part is pg_log and 
> pg_info. It will guides the crashed osd recovery its objects from the peers. 
> Therefore, if we can keep pg_log at a consistent point, we can recovery data 
> without journal. So can we just use an "undo" 
> strategy on pg_log and neglect ceph journal? It will save lots of bandwidth, 
> and also based on the consistent pg_log epoch, we can always recovery data 
> from its peering osd, right? But this will lead to recovery more objects if 
> the osd crash. 
> 
> Nicheal 
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to