Re: RBD thoughts

Milosz Tanski Wed, 07 May 2014 12:55:24 -0700

On Wed, May 7, 2014 at 3:32 PM, Sage Weil <[email protected]> wrote:
> On Wed, 7 May 2014, Allen Samuels wrote:
>> Ok, now I think I understand. Essentially, you have a write-ahead log +
>> lazy application of the log to the backend + code that correctly deals
>> with the RAW hazard (same as Cassandra, FileStore, LevelDB, etc.).
>> Correct?
>
> Right.
>
>> So every block write is done three times, once for the replication
>> journal, once in the FileStore journal and once in the target file
>> system. Correct?
>
> More than that, actually.  With the FileStore backend, every write is
> done 2x.  The rbd journal would be on top of rados objects, so that's 2*2.
> But that cost goes away with an improved backend that doesn't need a
> journal (like the kv backend or f2fs).


Side question. It's my understanding (via docks) that this also isn't
the case on btrfs since there it does a clone from journal (eg.
referencing same blocks on disk). Is that correct?

>
>> Also, if I understand the architecture, you'll be moving the data over
>> the network at least one more time (* # of replicas). Correct?
>
> Right; this would be mirrored in the target cluster, probably in another
> data center.
>
>> This seems VERY expensive in system resources, though I agree it's a
>> simpler implementation task.
>
> It's certainly not free. :)
>
> sage
>
>
>>
>> -----------------------------------------------------------
>> Never put off until tomorrow what you can do the day after tomorrow.
>>  Mark Twain
>>
>> Allen Samuels
>> Chief Software Architect, Emerging Storage Solutions
>>
>> 951 SanDisk Drive, Milpitas, CA 95035
>> T: +1 408 801 7030| M: +1 408 780 6416
>> [email protected]
>>
>>
>> -----Original Message-----
>> From: Sage Weil [mailto:[email protected]]
>> Sent: Wednesday, May 07, 2014 9:24 AM
>> To: Allen Samuels
>> Cc: [email protected]
>> Subject: RE: RBD thoughts
>>
>> On Wed, 7 May 2014, Allen Samuels wrote:
>> > Sage wrote:
>> > > Allen wrote:
>> > > > I was looking over the CDS for Giant and was paying particular
>> > > > attention to the rbd journaling stuff. Asynchronous
>> > > > geo-replications for block devices is really a key for enterprise
>> > > > deployment and this is the foundational element of that. It?s an
>> > > > area that we are keenly interested in and would be willing to
>> > > > devote development resources toward. It wasn?t clear from the
>> > > > recording whether this was just musings or would actually be
>> > > > development for Giant, but when you get your head above water
>> > > > w.r.t. the acquisition I?d like to investigate how we (Sandisk) could 
>> > > > help turn this into a real project. IMO, this is MUCH more important 
>> > > > than CephFS stuff for penetrating enterprises.
>> > > >
>> > > > The blueprint suggests the creation of an additional journal for
>> > > > the block device and that this journal would track metadata
>> > > > changes and potentially record overwritten data (without the
>> > > > overwritten data you can only sync to snapshots ? which will be
>> > > > reasonable functionality for some use-cases). It seems to me that
>> > > > this probably doesn?t work too well. Wouldn?t it be the case that
>> > > > you really want to commit to the journal AND to the block device
>> > > > atomically? That?s really problematic with the current RADOS
>> > > > design as the separate journal would be in a separate PG from the
>> > > > target block and likely on a separate OSD. Now you have all sorts of 
>> > > > cases of crashes/updates where the journal and the target block are 
>> > > > out of sync.
>> > >
>> > > The idea is to make it a write-ahead journal, which avoids any need
>> > > for atomicity.  The writes are streamed to the journal, and applied
>> > > to the rbd image proper only after they commit there.  Since block
>> > > operations are effeictively idempotent (you can replay the journal
>> > > from any point and the end result is always the same) the recovery
>> > > case is pretty simple.
>> >
>> > Who is responsible for the block device part of the commit?. If it's
>> > the RBD code rather than the OSD, then I think there's a dangerous
>> > failure case where the journal commits and then the client crashes and
>> > the journal-based replication system ends up replicating the last
>> > (un-performed) write operation. If it's the OSDs that are responsible,
>> > then this is not an issue.
>>
>> The idea is to use the usual set of write-ahead journaling tricks: we write 
>> first to the journal, then to the device, and lazily update a pointer 
>> indicating which journal events have been applied.  After a crash, the new 
>> client will reapply anything in the journal after that point to ensure the 
>> device is in sync.
>>
>> While the device is in active use, we'd need to track which writes have not 
>> yet been applied to the device so we can delay a read following a recent 
>> write until it is applied.  (This should be very rare, given that the file 
>> system sitting on top of the device is generally doing all sorts of caching.)
>>
>> This only works, of course, for use-cases where there is a single active 
>> writer for the device.  That means it's usable for local file systems like
>> ext3/4 and xfs, but not for someting like ocfs2.
>>
>> > > Similarly, I don't think the snapshot limitation is there; you can
>> > > simply note the journal offset, then copy the image (in a racy way),
>> > > and then replay the journal from that position to capture the recent
>> > > updates.
>> >
>> > w.r.t. snapshots and non-old-data-preserving journaling mode, How will
>> > you deal with the race between reading the head of the journal and
>> > reading the data referenced by that head of the journal that could be
>> > over-written by a write operation before you can actually read it?
>>
>> Oh, I think I'm using different terminology.  I'm assuming that the journal 
>> includes the *new* data (ala data=journal mode for ext*).  We talked a bit 
>> at CDS about an optional separate journal with overwritten data so that you 
>> could 'rewind' activity on an image, but that is probably not what you were 
>> talking about :).
>>
>> > > > Even past the functional level issues this probably creates a
>> > > > performance hot-spot too ? also undesirable.
>> > >
>> > > For a naive journal implementation and busy block device, yes.  What
>> > > I'd like to do, though, is make a journal abstraction on top of
>> > > librados that can eventually also replace the current MDS journaler
>> > > and do things a bit more intelligently.  The main thing would be to
>> > > stripe events over a set of objects to distribute the load.  For the
>> > > MDS, there are a bunch of other minor things we want to do to
>> > > streamline the implementation and to improve the ability to inspect and 
>> > > repair the journal.
>> > >
>> > > Note that the 'old data' would be an optional thing that would only
>> > > be enabled if the user wanted the ability to rewind.
>> > >
>> > > > It seems to me that the extra journal isn?t necessary, i.e., that
>> > > > the current PG log already has most of the information that?s
>> > > > needed (it doesn?t have the ?old data?, but that?s easily added ?
>> > > > in fact it?s cheaper to add it in with a special transaction token
>> > > > because you don?t have to send the ?old data? over the wire twice?
>> > > > the OSD can read it locally to put into the PG log). Of course, PG
>> > > > logs aren?t synchronized across the pool but that?s easy [...]
>> > >
>> > > I don't think the pg log can be sanely repurposed for this.  It is a
>> > > metadata journal only, and needs to be in order to make peering work
>> > > effectively, whereas the rbd journal needs to be a data journal to
>> > > work well.  Also, if the updates are spread across all of the rbd
>> > > image blocks/objects, then it becomes impractical to stream them to
>> > > another cluster because you'll need to watch for those updates on
>> > > all objects (vs just the journal objects)...
>> >
>> > I don't see the difference between the pg-log "metadata" journal and
>> > the rbd journal (when running in the 'non-old-data-preserving' mode).
>> > Essentially, the pg-log allows a local replica to "catch up", how is
>> > that different then allowing a non-local rbd to "catch up"??
>>
>> The PG log only indicates which objects were touched and which versions are 
>> (now) the latest.  When recovery happens, we go get the latest version of 
>> the object from the usual location.  If there are two updates to the same 
>> object the log tells us that happens but we don't preserved the intermediate 
>> version.  The rbd data journal, on the other hand, would preserve the full 
>> update timeline, ensuring that we have a fully-coherent view of the image at 
>> any point in the timeline.
>>
>> --
>>
>> In any case, this is the proposal we originally discussed at CDS.  I'm not 
>> sure if it's the best or most efficient, but I think it is relatively simple 
>> to implement and takes advantage of the existing abstractions and 
>> interfaces.  Input is definitely welcome!  I'm skeptical that the pg log 
>> will be useful in this case, but you're right that the overhead with the 
>> proposed approach is non-trivial...
>>
>> sage
>>
>>
>> ________________________________
>>
>> PLEASE NOTE: The information contained in this electronic mail message is 
>> intended only for the use of the designated recipient(s) named above. If the 
>> reader of this message is not the intended recipient, you are hereby 
>> notified that you have received this message in error and that any review, 
>> dissemination, distribution, or copying of this message is strictly 
>> prohibited. If you have received this communication in error, please notify 
>> the sender by telephone or e-mail (as shown above) immediately and destroy 
>> any and all copies of this message in your possession (whether hard copies 
>> or electronically stored copies).
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to [email protected]
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Milosz Tanski
CTO
10 East 53rd Street, 37th floor
New York, NY 10022

p: 646-253-9055
e: [email protected]
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: RBD thoughts

Reply via email to