Sean: Do you have more comments ? Cheers
On Fri, Sep 9, 2016 at 1:42 PM, Vladimir Rodionov <vladrodio...@gmail.com> wrote: > Sean, > > Backup/Restore can fail due to various reasons: network outage (cluster > wide), various time-outs in HBase and HDFS layer, M/R failure due to "HDFS > exceeded quota", user error (manual deletion of data) and so on so on. That > is impossible to enumerate all possible types of failures in a distributed > system - that is not our goal/task. > > We focus completely on backup system table consistency in a presence of any > type of failure. That is what I call "tolerance to failures". > > On a failure: > > BACKUP. All backup system information (prior to backup) will be restored > and all temporary data, related to a failed session, in HDFS will be > deleted > RESTORE. We do not care about system data, because restore does not change > it. Temporary data in HDFS will be cleaned up and table will be in a state > back to where it was before operation started. > > This is what user should expect in case of a failure. > > -Vlad > > > -Vlad > > On Fri, Sep 9, 2016 at 12:56 PM, Sean Busbey <bus...@apache.org> wrote: > > > Failing in a consistent way, with docs that explain the various > > expected failures would be sufficient. > > > > On Fri, Sep 9, 2016 at 12:16 PM, Vladimir Rodionov > > <vladrodio...@gmail.com> wrote: > > > Do not worry Sean, doc is coming today as a preview and our writer > Frank > > > will be working on a putting it into Apache repo. Timeline depends on > > > Franks schedule but I hope we will get it rather sooner than later. > > > > > > As for failure testing, we are focusing only on a consistent state of > > > backup system data in a presence of any type of failures, We are not > > going > > > to implement anything more "fancy", than that. We allow both: backup > and > > > restore to fail. What we do not allow is to have system data corrupted. > > > Will it suffice for you? Do you have any other concerns, you want us to > > > address? > > > > > > -Vlad > > > > > > > > > On Fri, Sep 9, 2016 at 10:56 AM, Sean Busbey <bus...@apache.org> > wrote: > > > > > >> "docs will come to Apache soon" does not address my concern around > docs > > at > > >> all, unless said docs have already made it into the project repo. I > > don't > > >> want third party resources for using a major and important feature of > > the > > >> project, I want us to provide end users with what they need to get the > > job > > >> done. > > >> > > >> I see some calls for patience on the failure testing, but the appeal > to > > us > > >> having done a bad job of requiring proper tests of previous features > > just > > >> makes me more concerned about not getting them here. I don't want to > set > > >> yet another bad example that will then be pointed to in the future. > > >> > > >> On Sep 8, 2016 10:50, "Ted Yu" <yuzhih...@gmail.com> wrote: > > >> > > >> > Is there any concern which is not addressed ? > > >> > > > >> > Do we need another Vote thread ? > > >> > > > >> > Thanks > > >> > > > >> > On Thu, Sep 8, 2016 at 9:21 AM, Andrew Purtell <apurt...@apache.org > > > > >> > wrote: > > >> > > > >> > > Vlad, > > >> > > > > >> > > I apologize for using the term 'half-baked' in a way that could > > seem a > > >> > > description of HBASE-7912. I meant that as a general hypothetical. > > >> > > > > >> > > On Wed, Sep 7, 2016 at 9:36 AM, Vladimir Rodionov < > > >> > vladrodio...@gmail.com> > > >> > > wrote: > > >> > > > > >> > > > >> I'm not sure that "There is already lots of half-baked code > in > > the > > >> > > > branch, > > >> > > > so what's the harm in adding more?" > > >> > > > > > >> > > > I meant - not production - ready yet. This is 2.0 development > > branch > > >> > and, > > >> > > > hence many features are in works, > > >> > > > not being tested well etc. I do not consider backup as half > baked > > >> > > feature - > > >> > > > it has passed our internal QA and has very good doc, which we > will > > >> > > provide > > >> > > > to Apache shortly. > > >> > > > > > >> > > > -Vlad > > >> > > > > > >> > > > On Wed, Sep 7, 2016 at 9:13 AM, Andrew Purtell < > > apurt...@apache.org> > > >> > > > wrote: > > >> > > > > > >> > > > > We shouldn't admit half baked changes that won't be finished. > > >> However > > >> > > in > > >> > > > > this case the crew working on this feature are long timers and > > less > > >> > > > likely > > >> > > > > than just about anyone to leave something in a half baked > > state. Of > > >> > > > course > > >> > > > > there is no guarantee how anything will turn out, but I am > > willing > > >> to > > >> > > > take > > >> > > > > a little on faith if they feel their best path forward now is > to > > >> > merge > > >> > > to > > >> > > > > trunk. I only wish I had bandwidth to have done some real > > kicking > > >> of > > >> > > the > > >> > > > > tires by now. Maybe this week. > > >> > > > > > > >> > > > > (Yes, I'm using some of that time for this email :-) but I > type > > >> > fast.) > > >> > > > > > > >> > > > > That said, I would like to agitate for making 2.0 more real > and > > >> spend > > >> > > > some > > >> > > > > time on it now that I'm winding down with 0.98. I think that > > means > > >> > > > > branching for 2.0 real soon now and even evicting things from > > 2.0 > > >> > > branch > > >> > > > > that aren't finished or stable, leaving them only once again > in > > the > > >> > > > master > > >> > > > > branch. Or, maybe just evicting them. Let's take it case by > > case. > > >> > > > > > > >> > > > > I think this feature can come in relatively safely. As added > > >> > insurance, > > >> > > > > let's admit the possibility it could be reverted on the 2.0 > > branch > > >> if > > >> > > > folks > > >> > > > > working on stabilizing 2.0 decide to evict it because it is > > >> > unfinished > > >> > > or > > >> > > > > unstable, because that certainly can happen. I would expect if > > talk > > >> > > like > > >> > > > > that starts, we'd get help finishing or stabilizing what's > under > > >> > > > discussion > > >> > > > > for revert. Or, we'd have a revert. Either way the outcome is > > >> > > acceptable. > > >> > > > > > > >> > > > > > > >> > > > > On Wed, Sep 7, 2016 at 8:56 AM, Dima Spivak < > > dimaspi...@apache.org > > >> > > > >> > > > wrote: > > >> > > > > > > >> > > > > > I'm not sure that "There is already lots of half-baked code > in > > >> the > > >> > > > > branch, > > >> > > > > > so what's the harm in adding more?" is a good code commit > > >> > philosophy > > >> > > > for > > >> > > > > a > > >> > > > > > fault-tolerant distributed data store. ;) > > >> > > > > > > > >> > > > > > More seriously, a lack of test coverage for existing > features > > >> > > shouldn't > > >> > > > > be > > >> > > > > > used as justification for introducing new features with the > > same > > >> > > > > > shortcomings. Ultimately, it's the end user who will feel > the > > >> pain, > > >> > > so > > >> > > > > > shouldn't we do everything we can to mitigate that? > > >> > > > > > > > >> > > > > > -Dima > > >> > > > > > > > >> > > > > > On Wed, Sep 7, 2016 at 8:46 AM, Vladimir Rodionov < > > >> > > > > vladrodio...@gmail.com> > > >> > > > > > wrote: > > >> > > > > > > > >> > > > > > > Sean, > > >> > > > > > > > > >> > > > > > > * have docs > > >> > > > > > > > > >> > > > > > > Agree. We have a doc and backup is the most documented > > feature > > >> > :), > > >> > > we > > >> > > > > > will > > >> > > > > > > release it shortly to Apache. > > >> > > > > > > > > >> > > > > > > * have sunny-day correctness tests > > >> > > > > > > > > >> > > > > > > Feature has close to 60 test cases, which run for approx > 30 > > >> min. > > >> > > We > > >> > > > > can > > >> > > > > > > add more, if community do not mind :) > > >> > > > > > > > > >> > > > > > > * have correctness-in-face-of-failure tests > > >> > > > > > > > > >> > > > > > > Any examples of these tests in existing features? In > works, > > we > > >> > > have a > > >> > > > > > clear > > >> > > > > > > understanding of what should be done by the time of 2.0 > > >> release. > > >> > > > > > > That is very close goal for us, to verify IT monkey for > > >> existing > > >> > > > code. > > >> > > > > > > > > >> > > > > > > * don't rely on things outside of HBase for normal > operation > > >> > (okay > > >> > > > for > > >> > > > > > > advanced operation) > > >> > > > > > > > > >> > > > > > > We do not. > > >> > > > > > > > > >> > > > > > > Enormous time has been spent already on the development > and > > >> > testing > > >> > > > the > > >> > > > > > > feature, it has passed our internal tests and many rounds > of > > >> code > > >> > > > > reviews > > >> > > > > > > by HBase committers. We do not mind if someone from HBase > > >> > community > > >> > > > > > > (outside of HW) will review the code, but it will probably > > >> takes > > >> > > > > forever > > >> > > > > > to > > >> > > > > > > wait for volunteer?, the feature is quite large (1MB+ > > >> cumulative > > >> > > > patch) > > >> > > > > > > > > >> > > > > > > 2.0 branch is full of half baked features, most of them > are > > in > > >> > > active > > >> > > > > > > development, therefore I am not following you here, Sean? > > Why > > >> > > > > HBASE-7912 > > >> > > > > > is > > >> > > > > > > not good enough yet to be integrated into 2.0 branch? > > >> > > > > > > > > >> > > > > > > -Vlad > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > > > >> > > > > > > On Wed, Sep 7, 2016 at 8:23 AM, Sean Busbey < > > bus...@apache.org > > >> > > > >> > > > wrote: > > >> > > > > > > > > >> > > > > > > > On Tue, Sep 6, 2016 at 10:36 PM, Josh Elser < > > >> > > josh.el...@gmail.com> > > >> > > > > > > wrote: > > >> > > > > > > > > So, the answer to Sean's original question is "as > > robust as > > >> > > > > snapshots > > >> > > > > > > > > presently are"? (independence of backup/restore > failure > > >> > > tolerance > > >> > > > > > from > > >> > > > > > > > > snapshot failure tolerance) > > >> > > > > > > > > > > >> > > > > > > > > Is this just a question WRT context of the change, or > > is it > > >> > > means > > >> > > > > > for a > > >> > > > > > > > veto > > >> > > > > > > > > from you, Sean? Just trying to make sure I'm following > > >> along > > >> > > > > > > adequately. > > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > I'd say ATM I'm -0, bordering on -1 but not for reasons > I > > can > > >> > > > > > articulate > > >> > > > > > > > well. > > >> > > > > > > > > > >> > > > > > > > Here's an attempt. > > >> > > > > > > > > > >> > > > > > > > We've been trying to move, as a community, towards > > minimizing > > >> > > risk > > >> > > > to > > >> > > > > > > > downstream folks by getting "complete enough for use" > > gates > > >> in > > >> > > > place > > >> > > > > > > > before we introduce new features. This was spurred by a > > some > > >> > > > features > > >> > > > > > > > getting in half-baked and never making it to "can really > > use" > > >> > > > status > > >> > > > > > > > (I'm thinking of distributed log replay and the zk-less > > >> > > assignment > > >> > > > > > > > stuff, I don't recall if there was more). > > >> > > > > > > > > > >> > > > > > > > The gates, generally, included things like: > > >> > > > > > > > > > >> > > > > > > > * have docs > > >> > > > > > > > * have sunny-day correctness tests > > >> > > > > > > > * have correctness-in-face-of-failure tests > > >> > > > > > > > * don't rely on things outside of HBase for normal > > operation > > >> > > (okay > > >> > > > > for > > >> > > > > > > > advanced operation) > > >> > > > > > > > > > >> > > > > > > > As an example, we kept the MOB work off in a branch and > > out > > >> of > > >> > > > master > > >> > > > > > > > until it could pass these criteria. The big exemption > > we've > > >> had > > >> > > to > > >> > > > > > > > this was the hbase-spark integration, where we all > agreed > > it > > >> > > could > > >> > > > > > > > land in master because it was very well isolated (the > > slide > > >> > away > > >> > > > from > > >> > > > > > > > including docs as a first-class part of building up that > > >> > > > integration > > >> > > > > > > > has led me to doubt the wisdom of this decision). > > >> > > > > > > > > > >> > > > > > > > We've also been treating inclusion in a "probably will > be > > >> > > released > > >> > > > to > > >> > > > > > > > downstream" branches as a higher bar, requiring > > >> > > > > > > > > > >> > > > > > > > * don't moderately impact performance when the feature > > isn't > > >> in > > >> > > use > > >> > > > > > > > * don't severely impact performance when the feature is > in > > >> use > > >> > > > > > > > * either default-to-on or show enough demand to believe > a > > >> > > > non-trivial > > >> > > > > > > > number of folks will turn the feature on > > >> > > > > > > > > > >> > > > > > > > The above has kept MOB and hbase-spark integration out > of > > >> > > branch-1, > > >> > > > > > > > presumably while they've "gotten more stable" in master > > from > > >> > the > > >> > > > odd > > >> > > > > > > > vendor inclusion. > > >> > > > > > > > > > >> > > > > > > > Are we going to have a 2.0 release before the end of the > > >> year? > > >> > > > We're > > >> > > > > > > > coming up on 1.5 years since the release of version 1.0; > > >> seems > > >> > > like > > >> > > > > > > > it's about time, though I haven't seen any concrete > plans > > >> this > > >> > > > year. > > >> > > > > > > > Presuming we are going to have one by the end of the > > year, it > > >> > > > seems a > > >> > > > > > > > bit close to still be adding in "features that need > > maturing" > > >> > on > > >> > > > the > > >> > > > > > > > branch. > > >> > > > > > > > > > >> > > > > > > > The lack of a concrete plan for 2.0 keeps me from > > considering > > >> > > these > > >> > > > > > > > things blocker at the moment. But I know first hand how > > much > > >> > > > trouble > > >> > > > > > > > folks have had with other features that have gone into > > >> > downstream > > >> > > > > > > > facing releases without robustness checks (i.e. > > replication), > > >> > and > > >> > > > I'm > > >> > > > > > > > concerned about what we're setting up if 2.0 goes out > with > > >> this > > >> > > > > > > > feature in its current state. > > >> > > > > > > > > > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > > >> > > > > > > >> > > > > -- > > >> > > > > Best regards, > > >> > > > > > > >> > > > > - Andy > > >> > > > > > > >> > > > > Problems worthy of attack prove their worth by hitting back. - > > Piet > > >> > > Hein > > >> > > > > (via Tom White) > > >> > > > > > > >> > > > > > >> > > > > >> > > > > >> > > > > >> > > -- > > >> > > Best regards, > > >> > > > > >> > > - Andy > > >> > > > > >> > > Problems worthy of attack prove their worth by hitting back. - > Piet > > >> Hein > > >> > > (via Tom White) > > >> > > > > >> > > > >> > > >