RE: RBD thoughts

Allen Samuels Wed, 07 May 2014 13:43:24 -0700

The extra network move that I was referring to would be local, i.e., from the 
node containing the write-ahead journal to the nodes containing the destination 
objects. I wasn't counting any geo-replication, that would be yet another 
network move.



-----------------------------------------------------------
Now I know what a statesman is; he's a dead politician. We need more statesmen. 
 Bob Edwards 

Allen Samuels
Chief Software Architect, Emerging Storage Solutions 

951 SanDisk Drive, Milpitas, CA 95035
T: +1 408 801 7030| M: +1 408 780 6416
[email protected]


-----Original Message-----
From: Sage Weil [mailto:[email protected]]
Sent: Wednesday, May 07, 2014 12:33 PM
To: Allen Samuels
Cc: [email protected]
Subject: RE: RBD thoughts

On Wed, 7 May 2014, Allen Samuels wrote:
> Ok, now I think I understand. Essentially, you have a write-ahead log
> + lazy application of the log to the backend + code that correctly
> deals with the RAW hazard (same as Cassandra, FileStore, LevelDB, etc.).
> Correct?

Right.

> So every block write is done three times, once for the replication 
> journal, once in the FileStore journal and once in the target file 
> system. Correct?

More than that, actually.  With the FileStore backend, every write is done 2x.  
The rbd journal would be on top of rados objects, so that's 2*2.  
But that cost goes away with an improved backend that doesn't need a journal 
(like the kv backend or f2fs).

> Also, if I understand the architecture, you'll be moving the data over 
> the network at least one more time (* # of replicas). Correct?

Right; this would be mirrored in the target cluster, probably in another data 
center.

> This seems VERY expensive in system resources, though I agree it's a 
> simpler implementation task.

It's certainly not free. :) 

sage


> 
> -----------------------------------------------------------
> Never put off until tomorrow what you can do the day after tomorrow.
>  Mark Twain
> 
> Allen Samuels
> Chief Software Architect, Emerging Storage Solutions
> 
> 951 SanDisk Drive, Milpitas, CA 95035
> T: +1 408 801 7030| M: +1 408 780 6416 [email protected]
> 
> 
> -----Original Message-----
> From: Sage Weil [mailto:[email protected]]
> Sent: Wednesday, May 07, 2014 9:24 AM
> To: Allen Samuels
> Cc: [email protected]
> Subject: RE: RBD thoughts
> 
> On Wed, 7 May 2014, Allen Samuels wrote:
> > Sage wrote:
> > > Allen wrote:
> > > > I was looking over the CDS for Giant and was paying particular 
> > > > attention to the rbd journaling stuff. Asynchronous 
> > > > geo-replications for block devices is really a key for 
> > > > enterprise deployment and this is the foundational element of 
> > > > that. It?s an area that we are keenly interested in and would be 
> > > > willing to devote development resources toward. It wasn?t clear 
> > > > from the recording whether this was just musings or would 
> > > > actually be development for Giant, but when you get your head 
> > > > above water w.r.t. the acquisition I?d like to investigate how we 
> > > > (Sandisk) could help turn this into a real project. IMO, this is MUCH 
> > > > more important than CephFS stuff for penetrating enterprises.
> > > >
> > > > The blueprint suggests the creation of an additional journal for 
> > > > the block device and that this journal would track metadata 
> > > > changes and potentially record overwritten data (without the 
> > > > overwritten data you can only sync to snapshots ? which will be 
> > > > reasonable functionality for some use-cases). It seems to me 
> > > > that this probably doesn?t work too well. Wouldn?t it be the 
> > > > case that you really want to commit to the journal AND to the 
> > > > block device atomically? That?s really problematic with the 
> > > > current RADOS design as the separate journal would be in a 
> > > > separate PG from the target block and likely on a separate OSD. Now you 
> > > > have all sorts of cases of crashes/updates where the journal and the 
> > > > target block are out of sync.
> > >
> > > The idea is to make it a write-ahead journal, which avoids any 
> > > need for atomicity.  The writes are streamed to the journal, and 
> > > applied to the rbd image proper only after they commit there.
> > > Since block operations are effeictively idempotent (you can replay 
> > > the journal from any point and the end result is always the same) 
> > > the recovery case is pretty simple.
> >
> > Who is responsible for the block device part of the commit?. If it's 
> > the RBD code rather than the OSD, then I think there's a dangerous 
> > failure case where the journal commits and then the client crashes 
> > and the journal-based replication system ends up replicating the 
> > last
> > (un-performed) write operation. If it's the OSDs that are 
> > responsible, then this is not an issue.
> 
> The idea is to use the usual set of write-ahead journaling tricks: we write 
> first to the journal, then to the device, and lazily update a pointer 
> indicating which journal events have been applied.  After a crash, the new 
> client will reapply anything in the journal after that point to ensure the 
> device is in sync.
> 
> While the device is in active use, we'd need to track which writes 
> have not yet been applied to the device so we can delay a read 
> following a recent write until it is applied.  (This should be very 
> rare, given that the file system sitting on top of the device is 
> generally doing all sorts of caching.)
> 
> This only works, of course, for use-cases where there is a single 
> active writer for the device.  That means it's usable for local file 
> systems like
> ext3/4 and xfs, but not for someting like ocfs2.
> 
> > > Similarly, I don't think the snapshot limitation is there; you can 
> > > simply note the journal offset, then copy the image (in a racy 
> > > way), and then replay the journal from that position to capture 
> > > the recent updates.
> >
> > w.r.t. snapshots and non-old-data-preserving journaling mode, How 
> > will you deal with the race between reading the head of the journal 
> > and reading the data referenced by that head of the journal that 
> > could be over-written by a write operation before you can actually read it?
> 
> Oh, I think I'm using different terminology.  I'm assuming that the journal 
> includes the *new* data (ala data=journal mode for ext*).  We talked a bit at 
> CDS about an optional separate journal with overwritten data so that you 
> could 'rewind' activity on an image, but that is probably not what you were 
> talking about :).
> 
> > > > Even past the functional level issues this probably creates a 
> > > > performance hot-spot too ? also undesirable.
> > >
> > > For a naive journal implementation and busy block device, yes.  
> > > What I'd like to do, though, is make a journal abstraction on top 
> > > of librados that can eventually also replace the current MDS 
> > > journaler and do things a bit more intelligently.  The main thing 
> > > would be to stripe events over a set of objects to distribute the 
> > > load.  For the MDS, there are a bunch of other minor things we 
> > > want to do to streamline the implementation and to improve the ability to 
> > > inspect and repair the journal.
> > >
> > > Note that the 'old data' would be an optional thing that would 
> > > only be enabled if the user wanted the ability to rewind.
> > >
> > > > It seems to me that the extra journal isn?t necessary, i.e., 
> > > > that the current PG log already has most of the information 
> > > > that?s needed (it doesn?t have the ?old data?, but that?s easily added ?
> > > > in fact it?s cheaper to add it in with a special transaction 
> > > > token because you don?t have to send the ?old data? over the wire twice?
> > > > the OSD can read it locally to put into the PG log). Of course, 
> > > > PG logs aren?t synchronized across the pool but that?s easy 
> > > > [...]
> > >
> > > I don't think the pg log can be sanely repurposed for this.  It is 
> > > a metadata journal only, and needs to be in order to make peering 
> > > work effectively, whereas the rbd journal needs to be a data 
> > > journal to work well.  Also, if the updates are spread across all 
> > > of the rbd image blocks/objects, then it becomes impractical to 
> > > stream them to another cluster because you'll need to watch for 
> > > those updates on all objects (vs just the journal objects)...
> >
> > I don't see the difference between the pg-log "metadata" journal and 
> > the rbd journal (when running in the 'non-old-data-preserving' mode).
> > Essentially, the pg-log allows a local replica to "catch up", how is 
> > that different then allowing a non-local rbd to "catch up"??
> 
> The PG log only indicates which objects were touched and which versions are 
> (now) the latest.  When recovery happens, we go get the latest version of the 
> object from the usual location.  If there are two updates to the same object 
> the log tells us that happens but we don't preserved the intermediate 
> version.  The rbd data journal, on the other hand, would preserve the full 
> update timeline, ensuring that we have a fully-coherent view of the image at 
> any point in the timeline.
> 
> --
> 
> In any case, this is the proposal we originally discussed at CDS.  I'm not 
> sure if it's the best or most efficient, but I think it is relatively simple 
> to implement and takes advantage of the existing abstractions and interfaces. 
>  Input is definitely welcome!  I'm skeptical that the pg log will be useful 
> in this case, but you're right that the overhead with the proposed approach 
> is non-trivial...
> 
> sage
> 
> 
> ________________________________
> 
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to [email protected] More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

RE: RBD thoughts

Reply via email to