To me this is a succinct specification of minimum functionality for a
recovery tool: using on disk bits, rebuild meta table, with end result a
working cluster that did not miss any data during the reconstruction.

Of course focusing on root causes of metadata mismanagement is appropriate
when investigating a specific incident, but this is orthogonal from the
question of whether or not recovery is possible after a bug corrupts
metadata. It is customary for filesystems and databases to ship with a tool
that attempts recovery after corruption, on the (correct, IMHO) assumption
that corruption is inevitable, either due to logic bug, hardware problems,
or operator error.

The features of hbck in HBase 1 that have resolved availability problems
where I work are: fixMeta, fixAssignments, fixHdfsHoles, fixHdfsOverlaps.
In HBaseFsck.java in branch-2 these are all in the unsupported options set.
Because these are all lacking in HBase 2 I will not certify it ready for
production to my employer. If there is some other tool which offers these
recovery options I'm not aware of it nor documentation for it and would
appreciate a pointer if you have one.


On Wed, May 29, 2019 at 7:11 AM Toshihiro Suzuki <brfrn...@apache.org>
wrote:

> Thanks Wellington.
>
> > I guess those can still be fixed with some combinations of commands
> today,
> > such as merge/assign.
>
> Let me explain the situation I faced in the customer's cluster a little bit
> more.
> It seemed like the table data in HDFS was intact but they lost some meta
> data
> (in hbase:meta) of the table. So I needed to rebuild the meta from HDFS
> data.
> In this case, we can still fix with some combinations of commands today? If
> so,
> I would appreciate it if you could suggest the steps to me.
>
> > And focus on fixing the main root cause of such problems, as a mean to
> > soften the need of use such commands.
>
> Yes, correct. Actually I usually do that. But I didn't do that in that
> case..
>
>
> On Wed, May 29, 2019 at 5:47 AM Wellington Chevreuil <
> wellington.chevre...@gmail.com> wrote:
>
> > Thanks Toshihiro! I guess those can still be fixed with some combinations
> > of commands today, such as merge/assign. Of course, it requires some
> extra
> > scripting and log reading on cases where many regions are in an
> > inconsistent state, maybe we should work on provide a one liner command
> > that relies on the current existing ones. And focus on fixing the main
> root
> > cause of such problems, as a mean to soften the need of use such
> commands.
> >
> > I'm not really a fan of offlinemetarepair, nor hbck1 fix holes/overlaps,
> > would rather not have those back. Sure those are easy and convenient to
> > trigger, but hbck1 reports are sometimes misleading (for instance, it
> > reports holes when region(s) on the chain is/are simply not online), and
> > that, combined with availability of such heavy hammers had led
> > unexperienced operators to fall into running it and getting into a worse
> > state.
> >
> > Em qua, 29 de mai de 2019 às 13:22, Toshihiro Suzuki <
> brfrn...@apache.org>
> > escreveu:
> >
> > > Hi Wellington,
> > >
> > > I saw table holes in a customer's cluster actually, and I just fixed
> the
> > > issues
> > > by the workaround I mentioned in HBASE-21665
> > > <https://issues.apache.org/jira/browse/HBASE-21665> and I didn't dig
> the
> > > reason
> > > why the table holes happened at that time because the customer didn't
> > want.
> > >
> > > However, IMO, whatever the reason I think we should have a direct way
> to
> > > fix
> > > holes and overlaps.
> > >
> > > On Wed, May 29, 2019 at 4:57 AM Wellington Chevreuil <
> > > wellington.chevre...@gmail.com> wrote:
> > >
> > > > So JMS, Toshihiro, seems like upgrading from some 1.x to 2.x
> > consistently
> > > > triggers this problem? Do you guys know if there are any bug jiras
> open
> > > > that would cover these scenarios? If not, and if you guys have enough
> > > > resources for investigating it, maybe worth open a specific jira?
> > > >
> > > > Em qua, 29 de mai de 2019 às 11:40, Jean-Marc Spaggiari <
> > > > jean-m...@spaggiari.org> escreveu:
> > > >
> > > > > Personnaly, when I tried to upgrade from 1.4.x to 2.2.x I end up
> in a
> > > > > situation where my meta was empty and had to get it repaired, but
> > > lacked
> > > > > OfflineMetaRepair for 2.2.x so I just had to delete all my tables,
> > get
> > > a
> > > > > brand new installation, recreate the tables and bulkload back the
> > data
> > > > into
> > > > > them. Would have been happy to have a OfflineMetaRepair.
> > > > >
> > > > > But it's more like an experimental cluster than a production one...
> > > > >
> > > > > JMS
> > > > >
> > > > > Le mer. 29 mai 2019 à 06:36, Wellington Chevreuil <
> > > > > wellington.chevre...@gmail.com> a écrit :
> > > > >
> > > > > > Interesting, I haven't seen any cases where OfflineMetaRepair was
> > > > really
> > > > > > required, among our customer base (running cdh6.1.x/hbase2.1.1,
> > > > > > cdh6.2/hbase2.1.2). Majority of RITs issue I had came with on
> hbase
> > > 2.x
> > > > > > were related to APs/SCPs failures, most of which could be sorted
> > with
> > > > > hbck2
> > > > > > commands available by then (in some cases, required some CLI
> > > scripting
> > > > to
> > > > > > build up a "bulk" assign command).
> > > > > >
> > > > > > Em qua, 29 de mai de 2019 às 00:55, Toshihiro Suzuki <
> > > > > brfrn...@apache.org>
> > > > > > escreveu:
> > > > > >
> > > > > > > Hi Josh,
> > > > > > >
> > > > > > > Thank you for the explanation. I agree with the direction for
> > > HBCK2.
> > > > > > >
> > > > > > > The problem I wanted to tell you in the Jira is that until we
> > > > implement
> > > > > > the
> > > > > > > features
> > > > > > > you mentioned, we don't have any direct way how to fix holes
> and
> > > > > > overlaps.
> > > > > > > The holes and overlaps can be created by bugs or operation
> > errors,
> > > > so I
> > > > > > > think we
> > > > > > > should be able to fix these issues.
> > > > > > >
> > > > > > > I thought OfflineMetaRepair could be a workaround for the
> issues
> > > > until
> > > > > we
> > > > > > > implement
> > > > > > > the features of HBCK2.
> > > > > > >
> > > > > > > Regards,
> > > > > > > Toshi
> > > > > > >
> > > > > > >
> > > > > > > On Tue, May 28, 2019 at 9:12 AM Josh Elser <els...@apache.org>
> > > > wrote:
> > > > > > >
> > > > > > > > Context: https://issues.apache.org/jira/browse/HBASE-21665
> > > > > > > >
> > > > > > > > I left a comment on the above issue about what I thought good
> > > > things
> > > > > to
> > > > > > > > build into HBCK2 would be -- a focus on specific "primitive"
> > > > > operations
> > > > > > > > that an admin/operator could use to help repair an otherwise
> > > broken
> > > > > > > > HBase installation. Some examples I had in my head were:
> > > > > > > >
> > > > > > > > * Create an empty region (to plug a hole)
> > > > > > > > * Report holes in a region chain
> > > > > > > >
> > > > > > > > In my head, the difference for HBCK2 was that we want to give
> > > folks
> > > > > the
> > > > > > > > tools to fix their cluster, but we did not want to own the
> > "just
> > > > fix
> > > > > > > > everything" kind of tool that HBCK1 had become. That problem
> > with
> > > > > HBCK1
> > > > > > > > was that it was often difficult/problematic for us to know
> how
> > to
> > > > > > > > correctly fix a problem (the same problem could be corrected
> in
> > > > > > > > different ways).
> > > > > > > >
> > > > > > > > Andrew had some confusion about this, so I'm not sure if I'm
> > > > off-base
> > > > > > or
> > > > > > > > if we're all in agreement on direction and we just need to
> do a
> > > > > better
> > > > > > > > job documenting things. Thanks for keeping me honest either
> way
> > > :)
> > > > > > > >
> > > > > > > > And just in case it doesn't go without saying, HBCK2 would be
> > > > > something
> > > > > > > > that helps fix a system, while we want to always understand
> the
> > > > root
> > > > > > > > cause of how/why we got into a situation where we needed
> HBCK2
> > > and
> > > > > also
> > > > > > > > address that.
> > > > > > > >
> > > > > > > > - Josh
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>


-- 
Best regards,
Andrew

Words like orphans lost among the crosstalk, meaning torn from truth's
decrepit hands
   - A23, Crosstalk

Reply via email to