Re: [DISCUSS] Direction of HBCK2

Stack Wed, 29 May 2019 15:10:40 -0700

Would be good to do a bit of evangelizing that hbck2 is intentionally not
meant to be like hbck1. hbck1 gave off the impression that it could fix
"all" problems, rebuilding master functionality on the exterior in a
contending script. Re-reading the hbck2 home page [1], hoping to find a
quote to back Josh's perception, it is plain the text needs to state more
forcefully the difference in philosophy.


On missing hbck2 functionality, there is an outstanding task (HBASE-21745)
sorting what is needed from hbck1 hangovers so the likes of our Andrew has
confidence that should he hit an operational issue, he'll have tooling for
repair. Let's be judicious in what we add to hbck2. We've left behind many
of the problems hbck1 used 'fix'. A rebuild of meta should disaster hits
makes sense (and is a long-time ask). Fixup for the mess JMS is able to
make upgrading from hbase1 to hbase2 makes sense too since this is what our
users will be doing (File JIRAs w/ detail on the mess JMS?). Andrew made a
list a while back here that needs consideration (HBASE-21745).

S

1. https://github.com/apache/hbase-operator-tools/tree/master/hbase-hbck2



On Wed, May 29, 2019 at 8:55 AM Andrew Purtell <apurt...@apache.org> wrote:

> To me this is a succinct specification of minimum functionality for a
> recovery tool: using on disk bits, rebuild meta table, with end result a
> working cluster that did not miss any data during the reconstruction.
>
> Of course focusing on root causes of metadata mismanagement is appropriate
> when investigating a specific incident, but this is orthogonal from the
> question of whether or not recovery is possible after a bug corrupts
> metadata. It is customary for filesystems and databases to ship with a tool
> that attempts recovery after corruption, on the (correct, IMHO) assumption
> that corruption is inevitable, either due to logic bug, hardware problems,
> or operator error.
>
> The features of hbck in HBase 1 that have resolved availability problems
> where I work are: fixMeta, fixAssignments, fixHdfsHoles, fixHdfsOverlaps.
> In HBaseFsck.java in branch-2 these are all in the unsupported options set.
> Because these are all lacking in HBase 2 I will not certify it ready for
> production to my employer. If there is some other tool which offers these
> recovery options I'm not aware of it nor documentation for it and would
> appreciate a pointer if you have one.
>
>
> On Wed, May 29, 2019 at 7:11 AM Toshihiro Suzuki <brfrn...@apache.org>
> wrote:
>
> > Thanks Wellington.
> >
> > > I guess those can still be fixed with some combinations of commands
> > today,
> > > such as merge/assign.
> >
> > Let me explain the situation I faced in the customer's cluster a little
> bit
> > more.
> > It seemed like the table data in HDFS was intact but they lost some meta
> > data
> > (in hbase:meta) of the table. So I needed to rebuild the meta from HDFS
> > data.
> > In this case, we can still fix with some combinations of commands today?
> If
> > so,
> > I would appreciate it if you could suggest the steps to me.
> >
> > > And focus on fixing the main root cause of such problems, as a mean to
> > > soften the need of use such commands.
> >
> > Yes, correct. Actually I usually do that. But I didn't do that in that
> > case..
> >
> >
> > On Wed, May 29, 2019 at 5:47 AM Wellington Chevreuil <
> > wellington.chevre...@gmail.com> wrote:
> >
> > > Thanks Toshihiro! I guess those can still be fixed with some
> combinations
> > > of commands today, such as merge/assign. Of course, it requires some
> > extra
> > > scripting and log reading on cases where many regions are in an
> > > inconsistent state, maybe we should work on provide a one liner command
> > > that relies on the current existing ones. And focus on fixing the main
> > root
> > > cause of such problems, as a mean to soften the need of use such
> > commands.
> > >
> > > I'm not really a fan of offlinemetarepair, nor hbck1 fix
> holes/overlaps,
> > > would rather not have those back. Sure those are easy and convenient to
> > > trigger, but hbck1 reports are sometimes misleading (for instance, it
> > > reports holes when region(s) on the chain is/are simply not online),
> and
> > > that, combined with availability of such heavy hammers had led
> > > unexperienced operators to fall into running it and getting into a
> worse
> > > state.
> > >
> > > Em qua, 29 de mai de 2019 às 13:22, Toshihiro Suzuki <
> > brfrn...@apache.org>
> > > escreveu:
> > >
> > > > Hi Wellington,
> > > >
> > > > I saw table holes in a customer's cluster actually, and I just fixed
> > the
> > > > issues
> > > > by the workaround I mentioned in HBASE-21665
> > > > <https://issues.apache.org/jira/browse/HBASE-21665> and I didn't dig
> > the
> > > > reason
> > > > why the table holes happened at that time because the customer didn't
> > > want.
> > > >
> > > > However, IMO, whatever the reason I think we should have a direct way
> > to
> > > > fix
> > > > holes and overlaps.
> > > >
> > > > On Wed, May 29, 2019 at 4:57 AM Wellington Chevreuil <
> > > > wellington.chevre...@gmail.com> wrote:
> > > >
> > > > > So JMS, Toshihiro, seems like upgrading from some 1.x to 2.x
> > > consistently
> > > > > triggers this problem? Do you guys know if there are any bug jiras
> > open
> > > > > that would cover these scenarios? If not, and if you guys have
> enough
> > > > > resources for investigating it, maybe worth open a specific jira?
> > > > >
> > > > > Em qua, 29 de mai de 2019 às 11:40, Jean-Marc Spaggiari <
> > > > > jean-m...@spaggiari.org> escreveu:
> > > > >
> > > > > > Personnaly, when I tried to upgrade from 1.4.x to 2.2.x I end up
> > in a
> > > > > > situation where my meta was empty and had to get it repaired, but
> > > > lacked
> > > > > > OfflineMetaRepair for 2.2.x so I just had to delete all my
> tables,
> > > get
> > > > a
> > > > > > brand new installation, recreate the tables and bulkload back the
> > > data
> > > > > into
> > > > > > them. Would have been happy to have a OfflineMetaRepair.
> > > > > >
> > > > > > But it's more like an experimental cluster than a production
> one...
> > > > > >
> > > > > > JMS
> > > > > >
> > > > > > Le mer. 29 mai 2019 à 06:36, Wellington Chevreuil <
> > > > > > wellington.chevre...@gmail.com> a écrit :
> > > > > >
> > > > > > > Interesting, I haven't seen any cases where OfflineMetaRepair
> was
> > > > > really
> > > > > > > required, among our customer base (running cdh6.1.x/hbase2.1.1,
> > > > > > > cdh6.2/hbase2.1.2). Majority of RITs issue I had came with on
> > hbase
> > > > 2.x
> > > > > > > were related to APs/SCPs failures, most of which could be
> sorted
> > > with
> > > > > > hbck2
> > > > > > > commands available by then (in some cases, required some CLI
> > > > scripting
> > > > > to
> > > > > > > build up a "bulk" assign command).
> > > > > > >
> > > > > > > Em qua, 29 de mai de 2019 às 00:55, Toshihiro Suzuki <
> > > > > > brfrn...@apache.org>
> > > > > > > escreveu:
> > > > > > >
> > > > > > > > Hi Josh,
> > > > > > > >
> > > > > > > > Thank you for the explanation. I agree with the direction for
> > > > HBCK2.
> > > > > > > >
> > > > > > > > The problem I wanted to tell you in the Jira is that until we
> > > > > implement
> > > > > > > the
> > > > > > > > features
> > > > > > > > you mentioned, we don't have any direct way how to fix holes
> > and
> > > > > > > overlaps.
> > > > > > > > The holes and overlaps can be created by bugs or operation
> > > errors,
> > > > > so I
> > > > > > > > think we
> > > > > > > > should be able to fix these issues.
> > > > > > > >
> > > > > > > > I thought OfflineMetaRepair could be a workaround for the
> > issues
> > > > > until
> > > > > > we
> > > > > > > > implement
> > > > > > > > the features of HBCK2.
> > > > > > > >
> > > > > > > > Regards,
> > > > > > > > Toshi
> > > > > > > >
> > > > > > > >
> > > > > > > > On Tue, May 28, 2019 at 9:12 AM Josh Elser <
> els...@apache.org>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Context: https://issues.apache.org/jira/browse/HBASE-21665
> > > > > > > > >
> > > > > > > > > I left a comment on the above issue about what I thought
> good
> > > > > things
> > > > > > to
> > > > > > > > > build into HBCK2 would be -- a focus on specific
> "primitive"
> > > > > > operations
> > > > > > > > > that an admin/operator could use to help repair an
> otherwise
> > > > broken
> > > > > > > > > HBase installation. Some examples I had in my head were:
> > > > > > > > >
> > > > > > > > > * Create an empty region (to plug a hole)
> > > > > > > > > * Report holes in a region chain
> > > > > > > > >
> > > > > > > > > In my head, the difference for HBCK2 was that we want to
> give
> > > > folks
> > > > > > the
> > > > > > > > > tools to fix their cluster, but we did not want to own the
> > > "just
> > > > > fix
> > > > > > > > > everything" kind of tool that HBCK1 had become. That
> problem
> > > with
> > > > > > HBCK1
> > > > > > > > > was that it was often difficult/problematic for us to know
> > how
> > > to
> > > > > > > > > correctly fix a problem (the same problem could be
> corrected
> > in
> > > > > > > > > different ways).
> > > > > > > > >
> > > > > > > > > Andrew had some confusion about this, so I'm not sure if
> I'm
> > > > > off-base
> > > > > > > or
> > > > > > > > > if we're all in agreement on direction and we just need to
> > do a
> > > > > > better
> > > > > > > > > job documenting things. Thanks for keeping me honest either
> > way
> > > > :)
> > > > > > > > >
> > > > > > > > > And just in case it doesn't go without saying, HBCK2 would
> be
> > > > > > something
> > > > > > > > > that helps fix a system, while we want to always understand
> > the
> > > > > root
> > > > > > > > > cause of how/why we got into a situation where we needed
> > HBCK2
> > > > and
> > > > > > also
> > > > > > > > > address that.
> > > > > > > > >
> > > > > > > > > - Josh
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>
>
> --
> Best regards,
> Andrew
>
> Words like orphans lost among the crosstalk, meaning torn from truth's
> decrepit hands
>    - A23, Crosstalk
>

Re: [DISCUSS] Direction of HBCK2

Reply via email to