I created a set of tickets for what I think needs to happen: http://tracker.ceph.com/issues/14031. There's a doc linked from that ticket with more details on the design. -Sam
On Fri, Nov 13, 2015 at 10:26 AM, Tianshan Qu <qutians...@gmail.com> wrote: > Hi: > > sorry for half send. > > I'm working on this recently. I think we should distinguish the scene > of sequence write and small write. > > The first, overwrite is enough. The second small write's typical > optimal is a log based write, which do not need ec's reconstruct and > avoid parity chunk's extra read. > > Fast14 paper (Parity Logging with Reserved Space Towards Efficient > Updates and Recovery in Erasure-coded Clustered Storage) propose a > way to further improve the performance, which use pre allocate reserve > space after parity chunk to avoid fragmentation. > > I think the challenge to implement this in ceph is how to manager the > reserve space, since we may have no direct support. > > also, the kv is another good choice for the log. > > For the ec interface, I think we need extra function to apply parity change. > > > > 2015-11-14 0:16 GMT+08:00 Allen Samuels <allen.samu...@sandisk.com>: >> This scheme fundamentally relies on the temporary objects "gracefully" >> transitioning into being portions of full-up long-term durable objects. >> >> This means that if the allocation size for a temporary object significantly >> mismatches the size of the mutation (partial stripe write) you're creating a >> problem that's proportional to the mismatch size. >> >> So either, NewStore is able to efficiently allocate small chunks >> OR >> You have some kind of background cleanup process that reclaims that space >> (i.e., a "straightener") >> >> The right choices depend on having a shared notion of the operational >> profile that you're trying to optimize for. >> >> The fundamental question becomes, are you going to optimize for small-block >> random writes? >> >> In my experience this is a key use-case in virtually every customer's >> evaluation scenario. I believe we MUST make this case reasonably efficient. >> >> It seems to me that the lowest-complexity "fix" for the problem is to teach >> NewStore to have two different allocation sizes (big and small :)). >> Naturally the allocator becomes more complex. Worst case, you're now left >> with the garbage collection problem. Which I suspect could be punted to a >> subsequent release (i.e., I'm out of large blocks, but there's plenty of >> fragmented available space -- This can happen, but's a pretty pathological >> case which becomes rare-er and rare-er as you scale-out) >> >> Allen Samuels >> Software Architect, Emerging Storage Solutions >> >> 2880 Junction Avenue, Milpitas, CA 95134 >> T: +1 408 801 7030| M: +1 408 780 6416 >> allen.samu...@sandisk.com >> >> >> -----Original Message----- >> From: Samuel Just [mailto:sj...@redhat.com] >> Sent: Friday, November 13, 2015 7:39 AM >> To: Sage Weil <sw...@redhat.com> >> Cc: ceph-devel@vger.kernel.org; Allen Samuels <allen.samu...@sandisk.com>; >> Durgin, Josh <jdur...@redhat.com>; Farnum, Gregory <gfar...@redhat.com> >> Subject: Re: Notes from a discussion a design to allow EC overwrites >> >> Lazily persisting the intermediate entries would certainly also work, but >> there's an argument that it needlessly adds to the write transaction. >> >> Actually, we probably want to avoid having small writes be full stripe >> writes -- with a 8+3 code the difference between modifying a single >> stripelet and modifying the full stripe is 4 writes vs 11 writes. >> >> It means that during peering, any log we can find (particularly the shortest >> one) from the most recent active interval isn't an upper bound on writes >> committed to the client (the (actingset.size() - M - 1)th one is?) -- we'd >> have to think carefully about the implications of that. >> -Sam >> >> On Fri, Nov 13, 2015 at 5:35 AM, Sage Weil <sw...@redhat.com> wrote: >>> On Thu, 12 Nov 2015, Samuel Just wrote: >>>> I was present for a discussion about allowing EC overwrites and >>>> thought it would be good to summarize it for the list: >>>> >>>> Commit Protocol: >>>> 1) client sends write to primary >>>> 2) primary reads in partial stripes needed for partial stripe >>>> overwrites from replicas >>>> 3) primary sends prepares to participating replicas and queues its >>>> own prepare locally >>>> 4) once all prepares are complete, primary sends a commit to the >>>> client >>>> 5) primary sends applies to all participating replicas >>>> >>>> When we get the prepare, we write out a temp object with the data to >>>> be written. On apply, we use an objectstore primitive to atomically >>>> move those extents into the actual object. The log entry contains >>>> the name/id for the temp object so it can be applied on apply or removed >>>> on rollback. >>> >>> Currently we assume that temp objects are/can be cleared out on restart. >>> This will need to change. And we'll need to be careful that they get >>> cleaned out when peering completes (and the rollforward/rollback >>> decision is made. >>> >>> If the stripes are small, then the objectstore primitive may not >>> actually be that efficient. I'd suggest also hinting that the temp >>> object will be swapped later, so that the backend can, if it's small, >>> store it in a cheap temporary location in the expectation that it will get >>> rewritten later. >>> (In particular, the newstore allocation chunk is currently targetting >>> 512kb, and this will only be efficient with narrow stripes, so it'll >>> just get double-written. We'll want to keep the temp value in the kv >>> store [log, hopefully] and not bother to allocate disk and rewrite >>> it.) >>> >>>> Each log entry contains a list of the shard ids modified. During >>>> peering, we use the same protocol for choosing the authoritative log >>>> for the existing EC pool, except that we first take the longest >>>> candidate log and use it to extend shorter logs until they hit an entry >>>> they should have witnessed, but didn't. >>>> >>>> Implicit in the above scheme is the fact that if an object is >>>> written, but a particular shard isn't changed, the osd with that >>>> shard will have a copy of the object with the correct data, but an >>>> out of date object_into (notably, the version will be old). To avoid >>>> this messing up the missing set during the log scan on OSD start, >>>> we'll skip log entries we wouldn't have participated in (we may not >>>> even choose to store them, see below). This does generally pose a >>>> challenge for maintaining prior_version. It seems like it shouldn't >>>> be much of a problem since rollbacks can only happen on prepared log >>>> entries which haven't been applied, so log merging can never result in a >>>> divergent entry causing a missing object. I think we can get by without >>>> it then? >>>> >>>> We can go further with the above and actually never persist a log >>>> entry on a shard which it did not participate in. As long as peering >>>> works correctly, the union of the logs we got must have all entries. >>>> The primary will need to keep a complete copy in memory to manage dup op >>>> detection and recovery, however. >>> >>> That sounds more complex to me. Maybe instead we could lazily persist >>> the entries (on the next pg write) so that it is always a contiguous >>> sequence? >>> >>>> 2) above can also be done much more efficiently. Some codes will >>>> allow the parity chunks to be reconstructed more efficiently, so some >>>> thought will have to go into restructuring the plugin interface to >>>> allow more efficient >>> >>> Hopefully the stripe size is chosen such that most writes will end up >>> being full stripe writes (we should figure out if the EC performance >>> degrades significantly in that case?). >>> >>> An alternative would be to do something like >>> >>> 1) client sends write to primary >>> 2) primary sends prepare to the first M+1 shards, who write it in a >>> temporary object/location >>> 3) primary acks write once they ack >>> 4) asynchronously, primary recalculates the affected stripes and sends >>> an overwrite. >>> >>> - step 4 doesn't need to be 2-phase, since we have the original data >>> persisted already on >M shards >>> - the client-observed latency is bounded by only M+1 OSDs (not >>> acting.size()) >>> >>> I suspect you discussed this option, though, and have other concerns >>> around its complexity? >>> >>> sage > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html