Re: [DISCUSSION] Merge Backup / Restore - Branch HBASE-7912

Vladimir Rodionov Fri, 09 Sep 2016 13:43:23 -0700

Sean,

Backup/Restore can fail due to various reasons: network outage (cluster
wide), various time-outs in HBase and HDFS layer, M/R failure due to "HDFS
exceeded quota", user error (manual deletion of data) and so on so on. That
is impossible to enumerate all possible types of failures in a distributed
system - that is not our goal/task.


We focus completely on backup system table consistency in a presence of any
type of failure. That is what I call "tolerance to failures".

On a failure:

BACKUP. All backup system information (prior to backup) will be restored
and all temporary data, related to a failed session, in HDFS will be deleted
RESTORE. We do not care about system data, because restore does not change
it. Temporary data in HDFS will be cleaned up and table will be in a state
back to where it was before operation started.

This is what user should expect in case of a failure.

-Vlad


-Vlad

On Fri, Sep 9, 2016 at 12:56 PM, Sean Busbey <[email protected]> wrote:

> Failing in a consistent way, with docs that explain the various
> expected failures would be sufficient.
>
> On Fri, Sep 9, 2016 at 12:16 PM, Vladimir Rodionov
> <[email protected]> wrote:
> > Do not worry Sean, doc is coming today as a preview and our writer Frank
> > will be working on a putting  it into Apache repo. Timeline depends on
> > Franks schedule but I hope we will get it rather sooner than later.
> >
> > As for failure testing, we are focusing only on a consistent state of
> > backup system data in a presence of any type of failures, We are not
> going
> > to implement  anything more "fancy", than that. We allow both: backup and
> > restore to fail. What we do not allow is to have system data corrupted.
> > Will it suffice for you? Do you have any other concerns, you want us to
> > address?
> >
> > -Vlad
> >
> >
> > On Fri, Sep 9, 2016 at 10:56 AM, Sean Busbey <[email protected]> wrote:
> >
> >> "docs will come to Apache soon" does not address my concern around docs
> at
> >> all, unless said docs have already made it into the project repo. I
> don't
> >> want third party resources for using a major and important feature of
> the
> >> project, I want us to provide end users with what they need to get the
> job
> >> done.
> >>
> >> I see some calls for patience on the failure testing, but the appeal to
> us
> >> having done a bad job of requiring proper tests of previous features
> just
> >> makes me more concerned about not getting them here. I don't want to set
> >> yet another bad example that will then be pointed to in the future.
> >>
> >> On Sep 8, 2016 10:50, "Ted Yu" <[email protected]> wrote:
> >>
> >> > Is there any concern which is not addressed ?
> >> >
> >> > Do we need another Vote thread ?
> >> >
> >> > Thanks
> >> >
> >> > On Thu, Sep 8, 2016 at 9:21 AM, Andrew Purtell <[email protected]>
> >> > wrote:
> >> >
> >> > > Vlad,
> >> > >
> >> > > I apologize for using the term 'half-baked' in a way that could
> seem a
> >> > > description of HBASE-7912. I meant that as a general hypothetical.
> >> > >
> >> > > On Wed, Sep 7, 2016 at 9:36 AM, Vladimir Rodionov <
> >> > [email protected]>
> >> > > wrote:
> >> > >
> >> > > > >> I'm not sure that "There is already lots of half-baked code in
> the
> >> > > > branch,
> >> > > > so what's the harm in adding more?"
> >> > > >
> >> > > > I meant - not production - ready yet. This is 2.0 development
> branch
> >> > and,
> >> > > > hence many features are in works,
> >> > > > not being tested well etc. I do not consider backup as half baked
> >> > > feature -
> >> > > > it has passed our internal QA and has very good doc, which we will
> >> > > provide
> >> > > > to Apache shortly.
> >> > > >
> >> > > > -Vlad
> >> > > >
> >> > > > On Wed, Sep 7, 2016 at 9:13 AM, Andrew Purtell <
> [email protected]>
> >> > > > wrote:
> >> > > >
> >> > > > > We shouldn't admit half baked changes that won't be finished.
> >> However
> >> > > in
> >> > > > > this case the crew working on this feature are long timers and
> less
> >> > > > likely
> >> > > > > than just about anyone to leave something in a half baked
> state. Of
> >> > > > course
> >> > > > > there is no guarantee how anything will turn out, but I am
> willing
> >> to
> >> > > > take
> >> > > > > a little on faith if they feel their best path forward now is to
> >> > merge
> >> > > to
> >> > > > > trunk. I only wish I had bandwidth to have done some real
> kicking
> >> of
> >> > > the
> >> > > > > tires by now. Maybe this week.
> >> > > > >
> >> > > > > (Yes, I'm using some of that time for this email :-) but I type
> >> > fast.)
> >> > > > >
> >> > > > > That said, I would like to agitate for making 2.0 more real and
> >> spend
> >> > > > some
> >> > > > > time on it now that I'm winding down with 0.98. I think that
> means
> >> > > > > branching for 2.0 real soon now and even evicting things from
> 2.0
> >> > > branch
> >> > > > > that aren't finished or stable, leaving them only once again in
> the
> >> > > > master
> >> > > > > branch. Or, maybe just evicting them. Let's take it case by
> case.
> >> > > > >
> >> > > > > I think this feature can come in relatively safely. As added
> >> > insurance,
> >> > > > > let's admit the possibility it could be reverted on the 2.0
> branch
> >> if
> >> > > > folks
> >> > > > > working on stabilizing 2.0 decide to evict it because it is
> >> > unfinished
> >> > > or
> >> > > > > unstable, because that certainly can happen. I would expect if
> talk
> >> > > like
> >> > > > > that starts, we'd get help finishing or stabilizing what's under
> >> > > > discussion
> >> > > > > for revert. Or, we'd have a revert. Either way the outcome is
> >> > > acceptable.
> >> > > > >
> >> > > > >
> >> > > > > On Wed, Sep 7, 2016 at 8:56 AM, Dima Spivak <
> [email protected]
> >> >
> >> > > > wrote:
> >> > > > >
> >> > > > > > I'm not sure that "There is already lots of half-baked code in
> >> the
> >> > > > > branch,
> >> > > > > > so what's the harm in adding more?" is a good code commit
> >> > philosophy
> >> > > > for
> >> > > > > a
> >> > > > > > fault-tolerant distributed data store. ;)
> >> > > > > >
> >> > > > > > More seriously, a lack of test coverage for existing features
> >> > > shouldn't
> >> > > > > be
> >> > > > > > used as justification for introducing new features with the
> same
> >> > > > > > shortcomings. Ultimately, it's the end user who will feel the
> >> pain,
> >> > > so
> >> > > > > > shouldn't we do everything we can to mitigate that?
> >> > > > > >
> >> > > > > > -Dima
> >> > > > > >
> >> > > > > > On Wed, Sep 7, 2016 at 8:46 AM, Vladimir Rodionov <
> >> > > > > [email protected]>
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > Sean,
> >> > > > > > >
> >> > > > > > > * have docs
> >> > > > > > >
> >> > > > > > > Agree. We have a doc and backup is the most documented
> feature
> >> > :),
> >> > > we
> >> > > > > > will
> >> > > > > > > release it shortly to Apache.
> >> > > > > > >
> >> > > > > > > * have sunny-day correctness tests
> >> > > > > > >
> >> > > > > > > Feature has  close to 60 test cases, which run for approx 30
> >> min.
> >> > > We
> >> > > > > can
> >> > > > > > > add more, if community do not mind :)
> >> > > > > > >
> >> > > > > > > * have correctness-in-face-of-failure tests
> >> > > > > > >
> >> > > > > > > Any examples of these tests in existing features? In works,
> we
> >> > > have a
> >> > > > > > clear
> >> > > > > > > understanding of what should be done by the time of 2.0
> >> release.
> >> > > > > > > That is very close goal for us, to verify IT monkey for
> >> existing
> >> > > > code.
> >> > > > > > >
> >> > > > > > > * don't rely on things outside of HBase for normal operation
> >> > (okay
> >> > > > for
> >> > > > > > > advanced operation)
> >> > > > > > >
> >> > > > > > > We do not.
> >> > > > > > >
> >> > > > > > > Enormous time has been spent already on the development and
> >> > testing
> >> > > > the
> >> > > > > > > feature, it has passed our internal tests and many rounds of
> >> code
> >> > > > > reviews
> >> > > > > > > by HBase committers. We do not mind if someone from HBase
> >> > community
> >> > > > > > > (outside of HW) will review the code, but it will probably
> >> takes
> >> > > > > forever
> >> > > > > > to
> >> > > > > > > wait for volunteer?, the feature is quite large (1MB+
> >> cumulative
> >> > > > patch)
> >> > > > > > >
> >> > > > > > > 2.0 branch is full of half baked features, most of them are
> in
> >> > > active
> >> > > > > > > development, therefore I am not following you here, Sean?
> Why
> >> > > > > HBASE-7912
> >> > > > > > is
> >> > > > > > > not good enough yet to be integrated into 2.0 branch?
> >> > > > > > >
> >> > > > > > > -Vlad
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > On Wed, Sep 7, 2016 at 8:23 AM, Sean Busbey <
> [email protected]
> >> >
> >> > > > wrote:
> >> > > > > > >
> >> > > > > > > > On Tue, Sep 6, 2016 at 10:36 PM, Josh Elser <
> >> > > [email protected]>
> >> > > > > > > wrote:
> >> > > > > > > > > So, the answer to Sean's original question is "as
> robust as
> >> > > > > snapshots
> >> > > > > > > > > presently are"? (independence of backup/restore failure
> >> > > tolerance
> >> > > > > > from
> >> > > > > > > > > snapshot failure tolerance)
> >> > > > > > > > >
> >> > > > > > > > > Is this just a question WRT context of the change, or
> is it
> >> > > means
> >> > > > > > for a
> >> > > > > > > > veto
> >> > > > > > > > > from you, Sean? Just trying to make sure I'm following
> >> along
> >> > > > > > > adequately.
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > >
> >> > > > > > > > I'd say ATM I'm -0, bordering on -1 but not for reasons I
> can
> >> > > > > > articulate
> >> > > > > > > > well.
> >> > > > > > > >
> >> > > > > > > > Here's an attempt.
> >> > > > > > > >
> >> > > > > > > > We've been trying to move, as a community, towards
> minimizing
> >> > > risk
> >> > > > to
> >> > > > > > > > downstream folks by getting "complete enough for use"
> gates
> >> in
> >> > > > place
> >> > > > > > > > before we introduce new features. This was spurred by a
> some
> >> > > > features
> >> > > > > > > > getting in half-baked and never making it to "can really
> use"
> >> > > > status
> >> > > > > > > > (I'm thinking of distributed log replay and the zk-less
> >> > > assignment
> >> > > > > > > > stuff, I don't recall if there was more).
> >> > > > > > > >
> >> > > > > > > > The gates, generally, included things like:
> >> > > > > > > >
> >> > > > > > > > * have docs
> >> > > > > > > > * have sunny-day correctness tests
> >> > > > > > > > * have correctness-in-face-of-failure tests
> >> > > > > > > > * don't rely on things outside of HBase for normal
> operation
> >> > > (okay
> >> > > > > for
> >> > > > > > > > advanced operation)
> >> > > > > > > >
> >> > > > > > > > As an example, we kept the MOB work off in a branch and
> out
> >> of
> >> > > > master
> >> > > > > > > > until it could pass these criteria. The big exemption
> we've
> >> had
> >> > > to
> >> > > > > > > > this was the hbase-spark integration, where we all agreed
> it
> >> > > could
> >> > > > > > > > land in master because it was very well isolated (the
> slide
> >> > away
> >> > > > from
> >> > > > > > > > including docs as a first-class part of building up that
> >> > > > integration
> >> > > > > > > > has led me to doubt the wisdom of this decision).
> >> > > > > > > >
> >> > > > > > > > We've also been treating inclusion in a "probably will be
> >> > > released
> >> > > > to
> >> > > > > > > > downstream" branches as a higher bar, requiring
> >> > > > > > > >
> >> > > > > > > > * don't moderately impact performance when the feature
> isn't
> >> in
> >> > > use
> >> > > > > > > > * don't severely impact performance when the feature is in
> >> use
> >> > > > > > > > * either default-to-on or show enough demand to believe a
> >> > > > non-trivial
> >> > > > > > > > number of folks will turn the feature on
> >> > > > > > > >
> >> > > > > > > > The above has kept MOB and hbase-spark integration out of
> >> > > branch-1,
> >> > > > > > > > presumably while they've "gotten more stable" in master
> from
> >> > the
> >> > > > odd
> >> > > > > > > > vendor inclusion.
> >> > > > > > > >
> >> > > > > > > > Are we going to have a 2.0 release before the end of the
> >> year?
> >> > > > We're
> >> > > > > > > > coming up on 1.5 years since the release of version 1.0;
> >> seems
> >> > > like
> >> > > > > > > > it's about time, though I haven't seen any concrete plans
> >> this
> >> > > > year.
> >> > > > > > > > Presuming we are going to have one by the end of the
> year, it
> >> > > > seems a
> >> > > > > > > > bit close to still be adding in "features that need
> maturing"
> >> > on
> >> > > > the
> >> > > > > > > > branch.
> >> > > > > > > >
> >> > > > > > > > The lack of a concrete plan for 2.0 keeps me from
> considering
> >> > > these
> >> > > > > > > > things blocker at the moment. But I know first hand how
> much
> >> > > > trouble
> >> > > > > > > > folks have had with other features that have gone into
> >> > downstream
> >> > > > > > > > facing releases without robustness checks (i.e.
> replication),
> >> > and
> >> > > > I'm
> >> > > > > > > > concerned about what we're setting up if 2.0 goes out with
> >> this
> >> > > > > > > > feature in its current state.
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > --
> >> > > > > Best regards,
> >> > > > >
> >> > > > >    - Andy
> >> > > > >
> >> > > > > Problems worthy of attack prove their worth by hitting back. -
> Piet
> >> > > Hein
> >> > > > > (via Tom White)
> >> > > > >
> >> > > >
> >> > >
> >> > >
> >> > >
> >> > > --
> >> > > Best regards,
> >> > >
> >> > >    - Andy
> >> > >
> >> > > Problems worthy of attack prove their worth by hitting back. - Piet
> >> Hein
> >> > > (via Tom White)
> >> > >
> >> >
> >>
>

Re: [DISCUSSION] Merge Backup / Restore - Branch HBASE-7912

Reply via email to