Thanks again, Rajesh. ----- Original Message -----
> From: "Rajesh Joseph" <rjos...@redhat.com> > To: "Paul Cuzner" <pcuz...@redhat.com> > Cc: "gluster-devel" <gluster-devel@nongnu.org> > Sent: Wednesday, 9 April, 2014 12:04:35 AM > Subject: Re: [Gluster-devel] GlusterFS Snapshot internals > Hi Paul, > Whenever a brick comes online it performs a handshake with glusterd. The > brick will not send a notification to > clients until the handshake is done. We are planning to provide an extension > to this and recreate those missing snaps. > Best Regards, > Rajesh > ----- Original Message ----- > From: "Paul Cuzner" <pcuz...@redhat.com> > To: "Rajesh Joseph" <rjos...@redhat.com> > Cc: "gluster-devel" <gluster-devel@nongnu.org> > Sent: Tuesday, April 8, 2014 12:49:13 PM > Subject: Re: [Gluster-devel] GlusterFS Snapshot internals > Rajesh, > Perfect explanation - the 'penny has dropped'. I was missing the healing > process of the snap being based on the snap from the replica. > One final question - I assume the scenario you mention about the brick coming > back online before the snapshots are taken is theoretical and there are > blocks in place to prevent this from happening? > BTW, I'll get the BZ RFE's in by the end of my week, and will post the BZ's > back to the list for info. > Thanks! > PC > ----- Original Message ----- > > From: "Rajesh Joseph" <rjos...@redhat.com> > > To: "Paul Cuzner" <pcuz...@redhat.com> > > Cc: "gluster-devel" <gluster-devel@nongnu.org> > > Sent: Tuesday, 8 April, 2014 5:09:10 PM > > Subject: Re: [Gluster-devel] GlusterFS Snapshot internals > > Hi Paul, > > It would be great if you can raise RFEs for both snap after restore and > > snapshot naming. > > Let's say your volume "Vol" has bricks b1, b2, b3 and b4. > > @0800 - S1 (snapshot volume) -> s1_b1, s1_b2, s1_b3, s1_b4 (These are > > respective snap bricks which are on independent thin LVs) > > @0830 - b1 went down > > @1000 - S2 (snapshot volume) -> s2_b1, x, s2_b3, s2_b4. Here we mark the > > brick has pending snapshot. > > Note that s2_b1 will have all the changes missed by b2 till 1000 hours. AFR > > will mark the > > pending changes on s2_b1. > > @1200 - S3 (Snapshot volume) -> s3_b1, x, s3_b3, s3_b4. This missed > > snapshot > > is also recorded. > > @1400 - S4 (Snapshot volume) -> s4_b1, x, s4_b3, s4_b4. This missed > > snapshot > > is also recorded. > > @1530 - b2 comes back. Before making it online we take snapshot s2_b2, > > s3_b2 > > and s4_b2. Since all > > these three snapshots are taken nearly at the same time content-wise all of > > them would be > > at the same state. Now these bricks are added to their respective volumes. > > Note that till > > now no healing is done. After addition snapshot volumes will look like > > this: > > S2 -> s2_b1, s2_b2, s2_b3, s2_b4. > > S3 -> s3_b1, s3_b2, s3_b3, s3_b4. > > S4 -> s4_b1, s4_b2, s4_b3, s4_b4. > > After this b2 will come online, i.e. clients can access this brick. Now S2, > > S3 and S4 is healed. > > s2_b2 will get healed from s2_b1, s3_b2 will be healed from s3_b1 and so on > > and so forth. > > This healing will take s2_b2 to the point when the snapshot is taken. > > If the bricks come online before taking these snapshots self heal will try > > to > > take the brick (b2) to point closer > > to the current time (@1530). Therefore it will not be consistent with the > > other replica-set. > > Please let me know if you have more questions or clarifications. > > Best Regards, > > Rajesh > > ----- Original Message ----- > > From: "Paul Cuzner" <pcuz...@redhat.com> > > To: "Rajesh Joseph" <rjos...@redhat.com> > > Cc: "gluster-devel" <gluster-devel@nongnu.org> > > Sent: Tuesday, April 8, 2014 8:01:57 AM > > Subject: Re: [Gluster-devel] GlusterFS Snapshot internals > > Thanks Rajesh. > > Let me know if I should raise any RFE's - snap after restore, snapshot > > naming, etc > > I'm still being thick about the snapshot process with missing bricks. What > > I'm missing is the heal process between snaps - my assumption is that the > > snap of a brick needs to be consistent with the other brick snaps within > > the > > same replica set. Lets use a home drive use case as an example - typically, > > I'd expect to see a home directories getting snapped at 0800, 1000, > > 1200,1400, 1600, 1800, 2200 each day. So in that context, say we have a > > dist-repl volume with 4 bricks, b1<->b2, b3<->b4; > > @ 0800 all bricks are available, snap (S1) succeeds with a snap volume > > being > > created from all bricks > > --- files continue to be changed and added > > @ 0830 b2 is unavailable (D0). Gluster tracks the pending updates on b1, > > needed to be applied to b2 > > --- files continue to be changed and added. > > @ 1000 snap requested - 3 of 4 bricks available, snap taken (S2) on b1, b3 > > and b4 - snapvolume activated > > --- files continue to change > > @ 1200 a further snap performed - S3 > > --- files continue to change > > @ 1400 snapshot S4 taken > > --- files change > > @ 1530 missing brick 2 comes back online (D1) > > Now between disruption of D0 and D1 there have been several snaps. My > > understanding is that each snap should provide a view of the filesystem > > consistent at the time of the snapshot - correct? > > You mention > > + brick2 comes up. At this moment we take a snapshot before we allow new > > I/O > > or heal of the brick. We multiple snaps are missed then all the snaps are > > taken at this time. We don't wait till the brick is brought to the same > > state as other bricks. > > + brick2_s1 (snap of brick2) will be added to s1 volume (snapshot volume). > > Self heal will take of bringing brick2 state to its other replica set. > > According to this description, if you snapshot b2 as soon as it's back > > online > > - that generates S1,S2 and S3 as at 08:30 - and lets self heal bring b2 up > > to the current time D1. However, doesn't this mean that S1,S2 and S3 on > > brick2 are not equal to S2,S3,S4 on brick1? > > If that is right, then if b1 is unavailable the corresponding snapshots on > > b2 > > wouldn't support the recovery points of 1000,1200 and 1400 - which we know > > are ok on b1. > > I guess I'd envisaged snapshots working hand-in-glove with self heal to > > maintain the snapshot consistency - and may just be stuck on that thought. > > Maybe this is something I'll only get on whiteboard - wouldn't be the first > > time :( > > I appreciate you patience in explaining this recovery process! > > ----- Original Message ----- > > > From: "Rajesh Joseph" <rjos...@redhat.com> > > > To: "Paul Cuzner" <pcuz...@redhat.com> > > > Cc: "gluster-devel" <gluster-devel@nongnu.org> > > > Sent: Monday, 7 April, 2014 10:12:53 PM > > > Subject: Re: [Gluster-devel] GlusterFS Snapshot internals > > > Thanks Paul for your valuable comments. Please find my comments in-lined > > > below. > > > Please let us know if you have more questions or clarifications. I will > > > try > > > to update the > > > doc where ever more clarity is needed. > > > Thanks & Regards, > > > Rajesh > > > ----- Original Message ----- > > > From: "Paul Cuzner" <pcuz...@redhat.com> > > > To: "Rajesh Joseph" <rjos...@redhat.com> > > > Cc: "gluster-devel" <gluster-devel@nongnu.org> > > > Sent: Monday, April 7, 2014 1:59:10 AM > > > Subject: Re: [Gluster-devel] GlusterFS Snapshot internals > > > Hi Rajesh, > > > Thanks for updating the design doc. It reads well. > > > I have a number of questions that would help my understanding; > > > Logging : The doc doesn't mention how the snapshot process is logged - > > > - will snapshot use an existing log or a new log? > > > [RJ]: As of now snapshot make use of existing logging framework. > > > - Will the log be specific to a volume, or will all snapshot activity be > > > logged in a single file? > > > [RJ]: Snapshot module is embedded in gluster core framework. Therefore > > > the > > > logs will also be part of glusterd logs. > > > - will the log be visible on all nodes, or just the originating node? > > > [RJ]: Similar to glusterd snapshot logs related to each node will be > > > visible > > > in those nodes. > > > - will the highlevel snapshot action be visible when looking from the > > > other > > > nodes either in the logs or at the cli? > > > [RJ]: As of now highlevel snapshot action will be visible only in the > > > logs > > > of > > > originator node. Though cli can be used see > > > list and info of snapshots from any other nodes. > > > Restore : You mention that after a restore operation, the snapshot will > > > be > > > automatically deleted. > > > - I don't believe this is a prudent thing to do. Here's an example, I've > > > seen > > > ALOT. Application has a programmatic error, leading to data 'corruption' > > > - > > > devs work on the program, storage guys roll the volume back. So far so > > > good...devs provide the updated program, and away you go...BUT the issue > > > is > > > not resolved, so you need to roll back again to the same point in time. > > > If > > > you delete the snap automatically, you loose the restore point. Yes the > > > admin could take another snap after the restore - but why add more work > > > into > > > a recovery process where people are already stressed out :) I'd recommend > > > leaving the snapshot if possible, and let it age out naturally. > > > [RJ]: Snapshot restore is a simple operation wherein volume bricks will > > > simply point to the brick snapshot instead of the original brick. > > > Therefore > > > once the restore is done we cannot use the same snapshot again. We are > > > planning to implement a configurable option which will automatically take > > > snapshot of the snapshot to fulfill the above mentioned requirement. But > > > with the given timeline and resources we will not be able to target it in > > > the coming release. > > > Auto-delete : Is this a post phase of the snapshot create, so the > > > successfully creation of a new snapshot will trigger the pruning of old > > > versions? > > > [RJ] Yes, if we reach the snapshot limit for a volume then the snapshot > > > create operation will trigger pruning of older snapshots. > > > Snapshot Naming : The doc states the name is mandatory. > > > - why not offer a default - volume_name_timestamp - instead of making the > > > caller decide on a name. Having this as a default will also make the list > > > under .snap more usable by default. > > > - providing a sensible default will make it easier for end users for self > > > service restore. More sensible defaults = more happy admins :) > > > [RJ]: This is a good to have feature we will try to incorporate this in > > > the > > > next release. > > > Quorum and snaprestore : the doc mentions that when a returning brick > > > comes > > > back, it will be snap'd before pending changes are applied. If I > > > understand > > > the use of quorum correctly, can you comment on the following scenario; > > > - With a brick offline, we'll be tracking changes. Say after 1hr a snap > > > is > > > invoked because quorum is met > > > - changes continue on the volume for another 15 minutes beyond the snap, > > > when > > > the offline brick comes back online. > > > - at this point there are two point in times to bring the brick back to - > > > the > > > brick needs the changes up to the point of the snap, then a snap of the > > > brick followed by the 'replay' of the additional changes to get back to > > > the > > > same point in time as the other replica's in the replica set. > > > - of course, the brick could be offline for 24 or 48 hours due to a > > > hardware > > > fault - during which time multiple snapshots could have been made > > > - it wasn't clear to me how this scenario is dealt with from the doc? > > > [RJ]: Following action is taken in case we miss a snapshot on brick. > > > + Lets say brick2 is down while taking snapshot s1. > > > + Snapshot s1 will be taken for all the bricks except brick2. Will update > > > the > > > bookkeeping about the missed activity. > > > + I/O can continue to happen on origin volume. > > > + brick2 comes up. At this moment we take a snapshot before we allow new > > > I/O > > > or heal of the brick. We multiple snaps are missed then all the snaps are > > > taken at this time. We don't wait till the brick is brought to the same > > > state as other bricks. > > > + brick2_s1 (snap of brick2) will be added to s1 volume (snapshot > > > volume). > > > Self heal will take of bringing brick2 state to its other replica set. > > > barrier : two things are mentioned here - a buffer size and a timeout > > > value. > > > - from an admin's pespective, being able to specify the timeout (secs) is > > > likely to be more workable - and will allow them to align this setting > > > with > > > any potential timeout setting within the application running against the > > > gluster volume. I don't think most admins will know or want to know how > > > to > > > size the buffer properly. > > > [RJ]: In the current release we are only providing the timeout value as a > > > configurable option. The buffer size is being considered for future > > > release > > > as configurable option or we find our-self what would be the optimal > > > value > > > based on user's system configuration. > > > Hopefully the above makes sense. > > > Cheers, > > > Paul C > > > ----- Original Message ----- > > > > From: "Rajesh Joseph" <rjos...@redhat.com> > > > > To: "gluster-devel" <gluster-devel@nongnu.org> > > > > Sent: Wednesday, 2 April, 2014 3:55:28 AM > > > > Subject: [Gluster-devel] GlusterFS Snapshot internals > > > > Hi all, > > > > I have updated the GlusterFS snapshot forge wiki. > > > > https://forge.gluster.org/snapshot/pages/Home > > > > Please go through it and let me know if you have any questions or > > > > queries. > > > > Best Regards, > > > > Rajesh > > > > [PS]: Please ignore previous mail. Accidentally hit send before > > > > completing > > > > :) > > > > _______________________________________________ > > > > Gluster-devel mailing list > > > > Gluster-devel@nongnu.org > > > > https://lists.nongnu.org/mailman/listinfo/gluster-devel
_______________________________________________ Gluster-devel mailing list Gluster-devel@nongnu.org https://lists.nongnu.org/mailman/listinfo/gluster-devel