Re: [ceph-users] SSD Journal

Somnath Roy Thu, 28 Jan 2016 14:36:12 -0800

Hi,
Ceph needs to maintain a journal in case of filestore as underlying filesystem 
like XFS *doesn’t have* any transactional semantics. Ceph has to do a 
transactional write with data and metadata in the write path. It does in the 
following way.


1. It creates a transaction object having multiple metadata operations and the 
actual payload write.

2. It is passed to Objectstore layer.

3. Objectstore can complete the transaction in sync or async (Filestore) way.

4.  Filestore dumps the entire Transaction object to the journal. It is a 
circular buffer and written to the disk sequentially with O_DIRECT | O_DSYNC 
way.

5. Once journal write is successful , write is acknowledged to the client. Read 
for this data is not allowed yet as it is still not been written to the actual 
location in the filesystem.

6. The actual execution of the transaction is done in parallel for the 
filesystem that can do check pointing like BTRFS. For the filesystem like 
XFS/ext4 the journal is write ahead i.e Tx object will be written to journal 
first and then the Tx execution will happen.

7. Tx execution is done in parallel by the filestore worker threads. The 
payload write is a buffered write and a sync thread within filestore is 
periodically calling ‘syncfs’ to persist data/metadata to the actual location.

8. Before each ‘syncfs’ call it determines the seq number till it is persisted 
and trim the transaction objects from journal upto that point. This will make 
room for more writes in the journal. If journal is full, write will be stuck.

9. If OSD is crashed after write is acknowledge, the Tx will be replayed from 
the last successful backend commit seq number (maintained in a file after 
‘syncfs’).

So, as you can see, it’s not a flaw but a necessity to have a journal for 
filestore in case of rbd workload as it can do partial overwrites. It is not 
needed for full writes like for objects and that’s the reason Sage came up with 
new store which will not be doing double writes for Object workload.
The keyvaluestore backend also doesn’t have any journal as it is relying on 
backend like leveldb/rocksdb for that.

Regarding Jan’s point for block vs file journal, IMO the only advantage of 
journal being a block device is filestore can do aio writes on that.

Now, here is what SanDisk changed..

1. In the write path Filestore has to do some throttling as journal can’t go 
much further than the actual backend write (Tx execution). We have introduced a 
dynamic throlling based on journal fill rate and a % increase from a config 
option filestore_queue_max_bytes. This config option keeps track of outstanding 
backend byte writes.

2. Instead of buffered write we have introduced a O_DSYNC write during 
transaction execution as it is reducing the amount of data syncfs has to write 
and thus getting a more stable performance out.

3. Main reason that we can’t allow journal to go further ahead as the Tx object 
will not be deleted till the Tx executes. More behind the Tx execution , more 
memory growth will happen. Presently, Tx object is deleted asynchronously (and 
thus taking more time)and we changed it to delete it from the filestore worker 
thread itself.

4. The sync thread is optimized to do a fast sync. The extra last commit seq 
file is not maintained any more for *the write ahead journal* as this 
information can be found in journal header.

Here is the related pull requests..




https://github.com/ceph/ceph/pull/7271

https://github.com/ceph/ceph/pull/7303

https://github.com/ceph/ceph/pull/7278

https://github.com/ceph/ceph/pull/6743



Regarding bypassing filesystem and accessing block device directly, yes, that 
should be more clean/simple and efficient solution. With Sage’s Bluestore, Ceph 
is moving towards that very fast !!!

Thanks & Regards
Somnath

From: ceph-users [mailto:[email protected]] On Behalf Of Tyler 
Bishop
Sent: Thursday, January 28, 2016 1:35 PM
To: Jan Schermer
Cc: [email protected]
Subject: Re: [ceph-users] SSD Journal

What approach did sandisk take with this for jewel?




 [http://static.beyondhosting.net/img/bh-small.png]


Tyler Bishop
Chief Technical Officer
513-299-7108 x10


[email protected]<mailto:[email protected]>



If you are not the intended recipient of this transmission you are notified 
that disclosing, copying, distributing or taking any action in reliance on the 
contents of this information is strictly prohibited.




________________________________
From: "Jan Schermer" <[email protected]<mailto:[email protected]>>
To: "Tyler Bishop" 
<[email protected]<mailto:[email protected]>>
Cc: "Bill WONG" <[email protected]<mailto:[email protected]>>, 
[email protected]<mailto:[email protected]>
Sent: Thursday, January 28, 2016 4:32:54 PM
Subject: Re: [ceph-users] SSD Journal

You can't run Ceph OSD without a journal. The journal is always there.
If you don't have a journal partition then there's a "journal" file on the OSD 
filesystem that does the same thing. If it's a partition then this file turns 
into a symlink.

You will always be better off with a journal on a separate partition because of 
the way writeback cache in linux works (someone correct me if I'm wrong).
The journal needs to flush to disk quite often, and linux is not always able to 
flush only the journal data. You can't defer metadata flushing forever and also 
doing fsync() makes all the dirty data flush as well. ext2/3/4 also flushes 
data to the filesystem periodicaly (5s is it I think?) which will make the 
latency of the journal go through the roof momentarily.
(I'll leave researching how exactly XFS does it to those who care about that 
"filesystem'o'thing").

P.S. I feel very strongly that this whole concept is broken fundamentaly. We 
already have a journal for the filesystem which is time proven, well behaved 
and above all fast. Instead there's this reinvented wheel which supposedly does 
it better in userspace while not really avoiding the filesystem journal either. 
It would maybe make sense if OSD was storing the data on a block device 
directly, avoiding the filesystem altogether. But it would still do the same 
bloody thing and (no disrespect) ext4 does this better than Ceph ever will.


On 28 Jan 2016, at 20:01, Tyler Bishop 
<[email protected]<mailto:[email protected]>> wrote:

This is an interesting topic that i've been waiting for.

Right now we run the journal as a partition on the data disk.  I've build 
drives without journals and the write performance seems okay but random io 
performance is poor in comparison to what it should be.


 [http://static.beyondhosting.net/img/bh-small.png]

Tyler Bishop
Chief Technical Officer
513-299-7108 x10

[email protected]<mailto:[email protected]>


If you are not the intended recipient of this transmission you are notified 
that disclosing, copying, distributing or taking any action in reliance on the 
contents of this information is strictly prohibited.



________________________________
From: "Bill WONG" <[email protected]<mailto:[email protected]>>
To: "ceph-users" <[email protected]<mailto:[email protected]>>
Sent: Thursday, January 28, 2016 1:36:01 PM
Subject: [ceph-users] SSD Journal

Hi,
i have tested with SSD Journal with SATA, it works perfectly.. now, i am 
testing with full SSD ceph cluster, now with full SSD ceph cluster, do i still 
need to have SSD as journal disk?

[ assumed i do not have PCIe SSD Flash which is better performance than normal 
SSD disk]

please give some ideas on full ssd ceph cluster ... thank you!

_______________________________________________
ceph-users mailing list
[email protected]<mailto:[email protected]>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
_______________________________________________
ceph-users mailing list
[email protected]<mailto:[email protected]>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] SSD Journal

Reply via email to