RE: puzzled with the design pattern of ceph journal, really ruining performance

Chen, Xiaoxi Wed, 17 Sep 2014 18:06:05 -0700

>When benching the crucial m550, I only see time to time (maybe each 30s,don't 
>remember exactly), ios slowing doing to 200 for 1 or 2 seconds then going up 
>to normal speed around 4000iops


Wow, that indicate m550 is busying with garbage collection , maybe just try to 
overprovision a bit (say if you have a 400G ssd , but only partition ~300G), 
overprovision SSD generally both help performance and durability. Actually if 
you look at the difference spec between Intel S3500 and S3700, the root cause 
is different over provision ratio :)

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Alexandre DERUMIER
Sent: Thursday, September 18, 2014 5:13 AM
To: Mark Nelson
Cc: Somnath Roy; ??; [email protected]; Chen, Xiaoxi
Subject: Re: puzzled with the design pattern of ceph journal, really ruining 
performance

>>FWIW, the journal will coalesce writes quickly when there are many 
>>concurrent 4k client writes.  Once you hit around 8 4k IOs per OSD, 
>>the journal will start coalescing.  For say 100-150 IOPs (what a 
>>spinning disk can handle), expect around 9ish 100KB journal writes 
>>(with padding and header/footer for each client IO).  What we've seen 
>>is that some drives that aren't that great at 4K O_DSYNC writes are 
>>still reasonably good with 8+ concurrent larger O_DSYNC writes.

Yes, indeed, it's not that bad. (hopefully ;)

When benching the crucial m550, I only see time to time (maybe each 30s,don't 
remember exactly), ios slowing doing to 200 for 1 or 2 seconds then going up to 
normal speed around 4000iops.

with intel s3500, I have constant write at 5000iops.


BTW, does rbd client cache also help for coalescing write (client side), then 
help also the journal ?





----- Mail original ----- 

De: "Mark Nelson" <[email protected]>
À: "Alexandre DERUMIER" <[email protected]>, "Xiaoxi Chen" 
<[email protected]>
Cc: "Somnath Roy" <[email protected]>, "??" <[email protected]>, 
[email protected]
Envoyé: Mercredi 17 Septembre 2014 17:01:06
Objet: Re: puzzled with the design pattern of ceph journal, really ruining 
performance 

On 09/17/2014 09:20 AM, Alexandre DERUMIER wrote: 
>>> 2. Have you got any data to prove the O_DSYNC or fdatasync kill the 
>>> performance of journal? In our previous test, the journal SSD (use a 
>>> partition of a SSD as a journal for a particular OSD, and 4 OSD 
>>> share a >>same SSD) could reach its peak performance (300-400MB/s)
> 
> Hi,
> 
> I have done some bench here: 
> 
> http://www.mail-archive.com/[email protected]/msg12950.html
> 
> Some ssd models have really bad performance with O_DSYNC (crucial m550 - 312 
> iops on 4k block). 
> Benching 1 osd,I can see big latencies for some seconds when O_DSYNC 
> occur

FWIW, the journal will coalesce writes quickly when there are many concurrent 
4k client writes. Once you hit around 8 4k IOs per OSD, the journal will start 
coalescing. For say 100-150 IOPs (what a spinning disk can handle), expect 
around 9ish 100KB journal writes (with padding and header/footer for each 
client IO). What we've seen is that some drives that aren't that great at 4K 
O_DSYNC writes are still reasonably good with 8+ concurrent larger O_DSYNC 
writes. 

> 
> 
> 
> crucial m550
> ------------
> #fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2 
> --group_reporting --invalidate=0 --name=ab --sync=1 bw=1249.9KB/s, 
> iops=312
> 
> intel s3500
> -----------
> fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2 
> --group_reporting --invalidate=0 --name=ab --sync=1 #bw=41794KB/s, 
> iops=10448
> 
> ----- Mail original -----
> 
> De: "Xiaoxi Chen" <[email protected]>
> À: "Somnath Roy" <[email protected]>, "??" <[email protected]>, 
> [email protected]
> Envoyé: Mercredi 17 Septembre 2014 09:59:37
> Objet: RE: puzzled with the design pattern of ceph journal, really 
> ruining performance
> 
> Hi Nicheal,
> 
> 1. The main purpose of journal is provide transaction semantics (prevent 
> partially update). Peer is not enough for this need because ceph writes all 
> replica at the same time, so when crush, you have no idea about which replica 
> has right data. For example, say if we have 2 replica, user update a 4M 
> object and the primary OSD crush when the first 2M was written , secondary 
> OSD may also failed when the first 3MB was written. So both versions in 
> primary/secondary are neither the new value, nor the old value, and have no 
> way to recover. So share the same idea as database, we need to have a journal 
> to support transaction and prevent this happen. For some backend support 
> transaction, BTRFS as an instance, we don't need a journal, we can write the 
> journal and data disk at the same time, the journal here is just try to help 
> performance, since it only do sequential write and we suspect it should be 
> faster than backend OSD. 
> 
> 2. Have you got any data to prove the O_DSYNC or fdatasync kill the 
> performance of journal? In our previous test, the journal SSD (use a 
> partition of a SSD as a journal for a particular OSD, and 4 OSD share 
> a same SSD) could reach its peak performance (300-400MB/s)
> 
> Xiaoxi
> 
> 
> 
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of Somnath Roy
> Sent: Wednesday, September 17, 2014 3:30 PM
> To: 姚宁; [email protected]
> Subject: RE: puzzled with the design pattern of ceph journal, really 
> ruining performance
> 
> Hi Nicheal,
> Not only recovery , IMHO the main purpose of ceph journal is to support 
> transaction semantics since XFS doesn't have that. I guess it can't be 
> achieved with pg_log/pg_info. 
> 
> Thanks & Regards
> Somnath
> 
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of ?? 
> Sent: Tuesday, September 16, 2014 11:29 PM
> To: [email protected]
> Subject: puzzled with the design pattern of ceph journal, really 
> ruining performance
> 
> Hi, guys
> 
> I analyze the architecture of the ceph souce code. 
> 
> I know that, in order to keep journal atomic and consistent, the journal 
> write mode should be set with O_DSYNC or called fdatasync() system call after 
> every write operation. However, this kind of operation is really killing the 
> performance as well as achieving high committing latency, even if SSD is used 
> as journal disk. If the SSD has capacitor to keep the data safe when the 
> system crashes, we can set the mount option nobarrier or SSD itself will 
> ignore the FLUSH REQUEST. So the performance would be better. 
> 
> So can it be instead by other strategies? 
> As far as I am concerned, I think the most important part is pg_log and 
> pg_info. It will guides the crashed osd recovery its objects from the peers. 
> Therefore, if we can keep pg_log at a consistent point, we can recovery data 
> without journal. So can we just use an "undo" 
> strategy on pg_log and neglect ceph journal? It will save lots of bandwidth, 
> and also based on the consistent pg_log epoch, we can always recovery data 
> from its peering osd, right? But this will lead to recovery more objects if 
> the osd crash. 
> 
> Nicheal
> 
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the 
body of a message to [email protected] More majordomo info at  
http://vger.kernel.org/majordomo-info.html
N�����r��y����b�X��ǧv�^�)޺{.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w���
���j:+v���w�j�m��������zZ+�����ݢj"��!�i

RE: puzzled with the design pattern of ceph journal, really ruining performance

Reply via email to