On Wed, 7 May 2014, Allen Samuels wrote: > Ok, now I think I understand. Essentially, you have a write-ahead log + > lazy application of the log to the backend + code that correctly deals > with the RAW hazard (same as Cassandra, FileStore, LevelDB, etc.). > Correct?
Right. > So every block write is done three times, once for the replication > journal, once in the FileStore journal and once in the target file > system. Correct? More than that, actually. With the FileStore backend, every write is done 2x. The rbd journal would be on top of rados objects, so that's 2*2. But that cost goes away with an improved backend that doesn't need a journal (like the kv backend or f2fs). > Also, if I understand the architecture, you'll be moving the data over > the network at least one more time (* # of replicas). Correct? Right; this would be mirrored in the target cluster, probably in another data center. > This seems VERY expensive in system resources, though I agree it's a > simpler implementation task. It's certainly not free. :) sage > > ----------------------------------------------------------- > Never put off until tomorrow what you can do the day after tomorrow. > Mark Twain > > Allen Samuels > Chief Software Architect, Emerging Storage Solutions > > 951 SanDisk Drive, Milpitas, CA 95035 > T: +1 408 801 7030| M: +1 408 780 6416 > [email protected] > > > -----Original Message----- > From: Sage Weil [mailto:[email protected]] > Sent: Wednesday, May 07, 2014 9:24 AM > To: Allen Samuels > Cc: [email protected] > Subject: RE: RBD thoughts > > On Wed, 7 May 2014, Allen Samuels wrote: > > Sage wrote: > > > Allen wrote: > > > > I was looking over the CDS for Giant and was paying particular > > > > attention to the rbd journaling stuff. Asynchronous > > > > geo-replications for block devices is really a key for enterprise > > > > deployment and this is the foundational element of that. It?s an > > > > area that we are keenly interested in and would be willing to > > > > devote development resources toward. It wasn?t clear from the > > > > recording whether this was just musings or would actually be > > > > development for Giant, but when you get your head above water > > > > w.r.t. the acquisition I?d like to investigate how we (Sandisk) could > > > > help turn this into a real project. IMO, this is MUCH more important > > > > than CephFS stuff for penetrating enterprises. > > > > > > > > The blueprint suggests the creation of an additional journal for > > > > the block device and that this journal would track metadata > > > > changes and potentially record overwritten data (without the > > > > overwritten data you can only sync to snapshots ? which will be > > > > reasonable functionality for some use-cases). It seems to me that > > > > this probably doesn?t work too well. Wouldn?t it be the case that > > > > you really want to commit to the journal AND to the block device > > > > atomically? That?s really problematic with the current RADOS > > > > design as the separate journal would be in a separate PG from the > > > > target block and likely on a separate OSD. Now you have all sorts of > > > > cases of crashes/updates where the journal and the target block are out > > > > of sync. > > > > > > The idea is to make it a write-ahead journal, which avoids any need > > > for atomicity. The writes are streamed to the journal, and applied > > > to the rbd image proper only after they commit there. Since block > > > operations are effeictively idempotent (you can replay the journal > > > from any point and the end result is always the same) the recovery > > > case is pretty simple. > > > > Who is responsible for the block device part of the commit?. If it's > > the RBD code rather than the OSD, then I think there's a dangerous > > failure case where the journal commits and then the client crashes and > > the journal-based replication system ends up replicating the last > > (un-performed) write operation. If it's the OSDs that are responsible, > > then this is not an issue. > > The idea is to use the usual set of write-ahead journaling tricks: we write > first to the journal, then to the device, and lazily update a pointer > indicating which journal events have been applied. After a crash, the new > client will reapply anything in the journal after that point to ensure the > device is in sync. > > While the device is in active use, we'd need to track which writes have not > yet been applied to the device so we can delay a read following a recent > write until it is applied. (This should be very rare, given that the file > system sitting on top of the device is generally doing all sorts of caching.) > > This only works, of course, for use-cases where there is a single active > writer for the device. That means it's usable for local file systems like > ext3/4 and xfs, but not for someting like ocfs2. > > > > Similarly, I don't think the snapshot limitation is there; you can > > > simply note the journal offset, then copy the image (in a racy way), > > > and then replay the journal from that position to capture the recent > > > updates. > > > > w.r.t. snapshots and non-old-data-preserving journaling mode, How will > > you deal with the race between reading the head of the journal and > > reading the data referenced by that head of the journal that could be > > over-written by a write operation before you can actually read it? > > Oh, I think I'm using different terminology. I'm assuming that the journal > includes the *new* data (ala data=journal mode for ext*). We talked a bit at > CDS about an optional separate journal with overwritten data so that you > could 'rewind' activity on an image, but that is probably not what you were > talking about :). > > > > > Even past the functional level issues this probably creates a > > > > performance hot-spot too ? also undesirable. > > > > > > For a naive journal implementation and busy block device, yes. What > > > I'd like to do, though, is make a journal abstraction on top of > > > librados that can eventually also replace the current MDS journaler > > > and do things a bit more intelligently. The main thing would be to > > > stripe events over a set of objects to distribute the load. For the > > > MDS, there are a bunch of other minor things we want to do to > > > streamline the implementation and to improve the ability to inspect and > > > repair the journal. > > > > > > Note that the 'old data' would be an optional thing that would only > > > be enabled if the user wanted the ability to rewind. > > > > > > > It seems to me that the extra journal isn?t necessary, i.e., that > > > > the current PG log already has most of the information that?s > > > > needed (it doesn?t have the ?old data?, but that?s easily added ? > > > > in fact it?s cheaper to add it in with a special transaction token > > > > because you don?t have to send the ?old data? over the wire twice? > > > > the OSD can read it locally to put into the PG log). Of course, PG > > > > logs aren?t synchronized across the pool but that?s easy [...] > > > > > > I don't think the pg log can be sanely repurposed for this. It is a > > > metadata journal only, and needs to be in order to make peering work > > > effectively, whereas the rbd journal needs to be a data journal to > > > work well. Also, if the updates are spread across all of the rbd > > > image blocks/objects, then it becomes impractical to stream them to > > > another cluster because you'll need to watch for those updates on > > > all objects (vs just the journal objects)... > > > > I don't see the difference between the pg-log "metadata" journal and > > the rbd journal (when running in the 'non-old-data-preserving' mode). > > Essentially, the pg-log allows a local replica to "catch up", how is > > that different then allowing a non-local rbd to "catch up"?? > > The PG log only indicates which objects were touched and which versions are > (now) the latest. When recovery happens, we go get the latest version of the > object from the usual location. If there are two updates to the same object > the log tells us that happens but we don't preserved the intermediate > version. The rbd data journal, on the other hand, would preserve the full > update timeline, ensuring that we have a fully-coherent view of the image at > any point in the timeline. > > -- > > In any case, this is the proposal we originally discussed at CDS. I'm not > sure if it's the best or most efficient, but I think it is relatively simple > to implement and takes advantage of the existing abstractions and interfaces. > Input is definitely welcome! I'm skeptical that the pg log will be useful > in this case, but you're right that the overhead with the proposed approach > is non-trivial... > > sage > > > ________________________________ > > PLEASE NOTE: The information contained in this electronic mail message is > intended only for the use of the designated recipient(s) named above. If the > reader of this message is not the intended recipient, you are hereby notified > that you have received this message in error and that any review, > dissemination, distribution, or copying of this message is strictly > prohibited. If you have received this communication in error, please notify > the sender by telephone or e-mail (as shown above) immediately and destroy > any and all copies of this message in your possession (whether hard copies or > electronically stored copies). > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to [email protected] > More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
