On 09/17/2014 09:20 AM, Alexandre DERUMIER wrote:
2. Have you got any data to prove the O_DSYNC or fdatasync  kill the performance of 
journal?   In our previous test, the journal SSD (use a partition of a SSD as a 
journal for a particular OSD, and 4 OSD share a >>same SSD) could reach its 
peak performance (300-400MB/s)

Hi,

I have done some bench here:

http://www.mail-archive.com/[email protected]/msg12950.html

Some ssd models have really bad performance with O_DSYNC (crucial m550 - 312 
iops on 4k block).
Benching 1 osd,I can see big latencies for some seconds when O_DSYNC occur

FWIW, the journal will coalesce writes quickly when there are many concurrent 4k client writes. Once you hit around 8 4k IOs per OSD, the journal will start coalescing. For say 100-150 IOPs (what a spinning disk can handle), expect around 9ish 100KB journal writes (with padding and header/footer for each client IO). What we've seen is that some drives that aren't that great at 4K O_DSYNC writes are still reasonably good with 8+ concurrent larger O_DSYNC writes.




crucial m550
------------
#fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2
--group_reporting --invalidate=0 --name=ab --sync=1
bw=1249.9KB/s, iops=312

intel s3500
-----------
fio --filename=/dev/sdb --direct=1 --rw=write --bs=4k --numjobs=2
--group_reporting --invalidate=0 --name=ab --sync=1
#bw=41794KB/s, iops=10448

----- Mail original -----

De: "Xiaoxi Chen" <[email protected]>
À: "Somnath Roy" <[email protected]>, "??" <[email protected]>, 
[email protected]
Envoyé: Mercredi 17 Septembre 2014 09:59:37
Objet: RE: puzzled with the design pattern of ceph journal, really ruining 
performance

Hi Nicheal,

1. The main purpose of journal is provide transaction semantics (prevent 
partially update). Peer is not enough for this need because ceph writes all 
replica at the same time, so when crush, you have no idea about which replica 
has right data. For example, say if we have 2 replica, user update a 4M object 
and the primary OSD crush when the first 2M was written , secondary OSD may 
also failed when the first 3MB was written. So both versions in 
primary/secondary are neither the new value, nor the old value, and have no way 
to recover. So share the same idea as database, we need to have a journal to 
support transaction and prevent this happen. For some backend support 
transaction, BTRFS as an instance, we don't need a journal, we can write the 
journal and data disk at the same time, the journal here is just try to help 
performance, since it only do sequential write and we suspect it should be 
faster than backend OSD.

2. Have you got any data to prove the O_DSYNC or fdatasync kill the performance 
of journal? In our previous test, the journal SSD (use a partition of a SSD as 
a journal for a particular OSD, and 4 OSD share a same SSD) could reach its 
peak performance (300-400MB/s)

Xiaoxi



-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Somnath Roy
Sent: Wednesday, September 17, 2014 3:30 PM
To: 姚宁; [email protected]
Subject: RE: puzzled with the design pattern of ceph journal, really ruining 
performance

Hi Nicheal,
Not only recovery , IMHO the main purpose of ceph journal is to support 
transaction semantics since XFS doesn't have that. I guess it can't be achieved 
with pg_log/pg_info.

Thanks & Regards
Somnath

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of ??
Sent: Tuesday, September 16, 2014 11:29 PM
To: [email protected]
Subject: puzzled with the design pattern of ceph journal, really ruining 
performance

Hi, guys

I analyze the architecture of the ceph souce code.

I know that, in order to keep journal atomic and consistent, the journal write 
mode should be set with O_DSYNC or called fdatasync() system call after every 
write operation. However, this kind of operation is really killing the 
performance as well as achieving high committing latency, even if SSD is used 
as journal disk. If the SSD has capacitor to keep the data safe when the system 
crashes, we can set the mount option nobarrier or SSD itself will ignore the 
FLUSH REQUEST. So the performance would be better.

So can it be instead by other strategies?
As far as I am concerned, I think the most important part is pg_log and pg_info. It will 
guides the crashed osd recovery its objects from the peers. Therefore, if we can keep 
pg_log at a consistent point, we can recovery data without journal. So can we just use an 
"undo"
strategy on pg_log and neglect ceph journal? It will save lots of bandwidth, 
and also based on the consistent pg_log epoch, we can always recovery data from 
its peering osd, right? But this will lead to recovery more objects if the osd 
crash.

Nicheal


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to