Re: [DISCUSS] Direction of HBCK2

2019-06-03 Thread Wellington Chevreuil
My take from this discussion is that some problems are worth adding "simple
to use" hbck2 commands (such as the meta missing regions one), while some
simpler/less critical problems could have recipes for using current
available commands documented. I suppose HBASE-21745 should be use for
triage of recurring/potential problems and discuss ideal solution (if it
should either demand a new hbck2 command, or can be a documented recipe).

Em qui, 30 de mai de 2019 às 22:17, Josh Elser  escreveu:

> Great! Thanks for clarifying.
>
> Script-able (and recipes for the common-ish problems -- both those we
> know and those we don't) are definitely goals in my head.
>
> On 5/30/19 5:06 PM, Andrew Purtell wrote:
> > Composable tools are fine if simple and scriptable.
> >
> > If you read the thread I think my complaint justifiable. It is not that
> they are lacking. It is that they are lacking and the response to the
> concern is breezy “oh just do this  knowledge>”. Just so we are clear what I am criticizing. Someone needs to
> call out in no uncertain terms how operator unfriendly this position is
> whether intentional or not.
> >
> > Thanks for the consideration.
> >
> >> On May 30, 2019, at 2:00 PM, Josh Elser  wrote:
> >>
> >> It sounds to me like you're saying: "No, I don't think compose-able
> tools are a sufficient substitute in HBCK2 for what HBCK1 did".
> >>
> >> I'm going to just delete everything else I want to write because it's
> going to turn into a massive argument and de-rail this further. For a
> second time, please stop the complaints about things that don't exist on
> this thread. We all know this already.
> >>
> >>> On 5/30/19 12:58 PM, Andrew Purtell wrote:
> >>> I did a both barrels type response to a suggestion Wellington made
> that I hope communicates the right level of dismay at the prevailing line
> of thought in this thread.
> >>> Let me say I agree hbck 1 was sometimes oversold as a magic tool.
> >>> However if you analyze all of its options and then look to branch 2,
> where are the gaps. In branch 1 there is a command line tool that can be
> executed by operations and first level support. Its options can be
> described in a runbook with cut and paste examples. In branch 2 ... ?
> >>> There appears no ready solution for detecting and deploying undeployed
> “missing” regions.
> >>> There appears no ready solution for fixing a failed split or merge or
> other corruption producing a hole or overlap in the region chain.
> >>> There appears no tool capable of rebuilding meta from scratch from
> HDFS level metadata; a last but crucial resort as this is what holds the
> line against a complete and time intensive restore from backup.
> >>> I may have an incorrect impression of some of this. If so that would
> be a big relief. If not these are suggested areas of focus.
> >>> I’m not saying that 2 needs Hbck exactly as it is in 1. However the
> lack of simple recovery tools or actions that can be taken by a non expert
> guided by a runbook means the risk to operations when there is the
> inevitable problem is higher. And I don’t mean theoretical problems. I mean
> the commonly occurring issues Hbck 1 was coded up to address in a mostly
> automated way, like failed splits or failed deployments or simple HDFS
> level corruptions like loss of meta region hfiles. Lacking simple tooling
> our operations will have to do  more complex, labor intensive,
> and or risky. This factors in to the major version upgrade risk analysis.
> >>> What I would advise is an analysis that enumerates all of the risks
> and specific conditions Hbck 1 addresses, then excludes those not relevant
> for the 2 code base, then excludes those which have easy and simple tools
> existing right now to solve. What you have left is a list of action items.
> Then there should be an analysis of the new risks in 2 given AMv2s theory
> of operation, for example for each procedure based action if the procedure
> is always failing how can the operator recover the prerequisites for
> successful completion, and provide a simple tool or option for applying a
> fix or remediation to cluster state.
>  On May 30, 2019, at 7:16 AM, Josh Elser  wrote:
> 
>  Right, this discuss isn't meant to be implying that any of this
> exists -- instead, I wanted to make sure we're focused on building tooling
> which both devs and users will find usable and effective.
> 
>  What's your gut-reaction to what I suggested? I think you're saying
> you see operators having to apply more understanding/insight to fix a
> "complex problem" as taking on more risk which you'd have to weigh. In
> other words, anything less than the verbatim "fix these problems" flags you
> mentioned earlier would require you to do the risk-analysis math if moving
> to HBase2?
> 
>  Thanks for your insights.
> 
> > On 5/29/19 4:45 PM, Andrew Purtell wrote:
> > I have yet to see essential HBCK functions in 1 replaced by anything
> -
> > documentation, 

Re: [DISCUSS] Direction of HBCK2

2019-05-30 Thread Josh Elser

Great! Thanks for clarifying.

Script-able (and recipes for the common-ish problems -- both those we 
know and those we don't) are definitely goals in my head.


On 5/30/19 5:06 PM, Andrew Purtell wrote:

Composable tools are fine if simple and scriptable.

If you read the thread I think my complaint justifiable. It is not that they are 
lacking. It is that they are lacking and the response to the concern is breezy “oh 
just do this ”. Just so we are 
clear what I am criticizing. Someone needs to call out in no uncertain terms how 
operator unfriendly this position is whether intentional or not.

Thanks for the consideration.


On May 30, 2019, at 2:00 PM, Josh Elser  wrote:

It sounds to me like you're saying: "No, I don't think compose-able tools are a 
sufficient substitute in HBCK2 for what HBCK1 did".

I'm going to just delete everything else I want to write because it's going to 
turn into a massive argument and de-rail this further. For a second time, 
please stop the complaints about things that don't exist on this thread. We all 
know this already.


On 5/30/19 12:58 PM, Andrew Purtell wrote:
I did a both barrels type response to a suggestion Wellington made that I hope 
communicates the right level of dismay at the prevailing line of thought in 
this thread.
Let me say I agree hbck 1 was sometimes oversold as a magic tool.
However if you analyze all of its options and then look to branch 2, where are 
the gaps. In branch 1 there is a command line tool that can be executed by 
operations and first level support. Its options can be described in a runbook 
with cut and paste examples. In branch 2 ... ?
There appears no ready solution for detecting and deploying undeployed 
“missing” regions.
There appears no ready solution for fixing a failed split or merge or other 
corruption producing a hole or overlap in the region chain.
There appears no tool capable of rebuilding meta from scratch from HDFS level 
metadata; a last but crucial resort as this is what holds the line against a 
complete and time intensive restore from backup.
I may have an incorrect impression of some of this. If so that would be a big 
relief. If not these are suggested areas of focus.
I’m not saying that 2 needs Hbck exactly as it is in 1. However the lack of simple 
recovery tools or actions that can be taken by a non expert guided by a runbook means 
the risk to operations when there is the inevitable problem is higher. And I don’t 
mean theoretical problems. I mean the commonly occurring issues Hbck 1 was coded up 
to address in a mostly automated way, like failed splits or failed deployments or 
simple HDFS level corruptions like loss of meta region hfiles. Lacking simple tooling 
our operations will have to do  more complex, labor intensive, and 
or risky. This factors in to the major version upgrade risk analysis.
What I would advise is an analysis that enumerates all of the risks and 
specific conditions Hbck 1 addresses, then excludes those not relevant for the 
2 code base, then excludes those which have easy and simple tools existing 
right now to solve. What you have left is a list of action items. Then there 
should be an analysis of the new risks in 2 given AMv2s theory of operation, 
for example for each procedure based action if the procedure is always failing 
how can the operator recover the prerequisites for successful completion, and 
provide a simple tool or option for applying a fix or remediation to cluster 
state.

On May 30, 2019, at 7:16 AM, Josh Elser  wrote:

Right, this discuss isn't meant to be implying that any of this exists -- 
instead, I wanted to make sure we're focused on building tooling which both 
devs and users will find usable and effective.

What's your gut-reaction to what I suggested? I think you're saying you see operators having to 
apply more understanding/insight to fix a "complex problem" as taking on more risk which 
you'd have to weigh. In other words, anything less than the verbatim "fix these problems" 
flags you mentioned earlier would require you to do the risk-analysis math if moving to HBase2?

Thanks for your insights.


On 5/29/19 4:45 PM, Andrew Purtell wrote:
I have yet to see essential HBCK functions in 1 replaced by anything -
documentation, script, hbck2, whatever.
Do we have a tool or script in HBase 2 that can rebuild meta from HDFS
state? This would be faster than a complete restore from backup. It would
be useful and important to offer this option to operators, but not
essential, because it could be valid to say if meta is screwed so are you
and you have to restore completely from backup. Meta is small, a fraction
of total data footprint. Seems a real shame to impose such a high cost when
there could be an alternative. I'd have to think for a while about
accepting this kind of operational risk when HBase 1 has such tooling.
What I am more worried about is this: Do we have a tool or script in HBase
2 that can fix errors in the region chain caused by failed 

Re: [DISCUSS] Direction of HBCK2

2019-05-30 Thread Andrew Purtell
Composable tools are fine if simple and scriptable. 

If you read the thread I think my complaint justifiable. It is not that they 
are lacking. It is that they are lacking and the response to the concern is 
breezy “oh just do this ”. Just so 
we are clear what I am criticizing. Someone needs to call out in no uncertain 
terms how operator unfriendly this position is whether intentional or not. 

Thanks for the consideration. 

> On May 30, 2019, at 2:00 PM, Josh Elser  wrote:
> 
> It sounds to me like you're saying: "No, I don't think compose-able tools are 
> a sufficient substitute in HBCK2 for what HBCK1 did".
> 
> I'm going to just delete everything else I want to write because it's going 
> to turn into a massive argument and de-rail this further. For a second time, 
> please stop the complaints about things that don't exist on this thread. We 
> all know this already.
> 
>> On 5/30/19 12:58 PM, Andrew Purtell wrote:
>> I did a both barrels type response to a suggestion Wellington made that I 
>> hope communicates the right level of dismay at the prevailing line of 
>> thought in this thread.
>> Let me say I agree hbck 1 was sometimes oversold as a magic tool.
>> However if you analyze all of its options and then look to branch 2, where 
>> are the gaps. In branch 1 there is a command line tool that can be executed 
>> by operations and first level support. Its options can be described in a 
>> runbook with cut and paste examples. In branch 2 ... ?
>> There appears no ready solution for detecting and deploying undeployed 
>> “missing” regions.
>> There appears no ready solution for fixing a failed split or merge or other 
>> corruption producing a hole or overlap in the region chain.
>> There appears no tool capable of rebuilding meta from scratch from HDFS 
>> level metadata; a last but crucial resort as this is what holds the line 
>> against a complete and time intensive restore from backup.
>> I may have an incorrect impression of some of this. If so that would be a 
>> big relief. If not these are suggested areas of focus.
>> I’m not saying that 2 needs Hbck exactly as it is in 1. However the lack of 
>> simple recovery tools or actions that can be taken by a non expert guided by 
>> a runbook means the risk to operations when there is the inevitable problem 
>> is higher. And I don’t mean theoretical problems. I mean the commonly 
>> occurring issues Hbck 1 was coded up to address in a mostly automated way, 
>> like failed splits or failed deployments or simple HDFS level corruptions 
>> like loss of meta region hfiles. Lacking simple tooling our operations will 
>> have to do  more complex, labor intensive, and or risky. This 
>> factors in to the major version upgrade risk analysis.
>> What I would advise is an analysis that enumerates all of the risks and 
>> specific conditions Hbck 1 addresses, then excludes those not relevant for 
>> the 2 code base, then excludes those which have easy and simple tools 
>> existing right now to solve. What you have left is a list of action items. 
>> Then there should be an analysis of the new risks in 2 given AMv2s theory of 
>> operation, for example for each procedure based action if the procedure is 
>> always failing how can the operator recover the prerequisites for successful 
>> completion, and provide a simple tool or option for applying a fix or 
>> remediation to cluster state.
>>> On May 30, 2019, at 7:16 AM, Josh Elser  wrote:
>>> 
>>> Right, this discuss isn't meant to be implying that any of this exists -- 
>>> instead, I wanted to make sure we're focused on building tooling which both 
>>> devs and users will find usable and effective.
>>> 
>>> What's your gut-reaction to what I suggested? I think you're saying you see 
>>> operators having to apply more understanding/insight to fix a "complex 
>>> problem" as taking on more risk which you'd have to weigh. In other words, 
>>> anything less than the verbatim "fix these problems" flags you mentioned 
>>> earlier would require you to do the risk-analysis math if moving to HBase2?
>>> 
>>> Thanks for your insights.
>>> 
 On 5/29/19 4:45 PM, Andrew Purtell wrote:
 I have yet to see essential HBCK functions in 1 replaced by anything -
 documentation, script, hbck2, whatever.
 Do we have a tool or script in HBase 2 that can rebuild meta from HDFS
 state? This would be faster than a complete restore from backup. It would
 be useful and important to offer this option to operators, but not
 essential, because it could be valid to say if meta is screwed so are you
 and you have to restore completely from backup. Meta is small, a fraction
 of total data footprint. Seems a real shame to impose such a high cost when
 there could be an alternative. I'd have to think for a while about
 accepting this kind of operational risk when HBase 1 has such tooling.
 What I am more worried about is this: Do we have a tool or script in HBase
 2 that can fix 

Re: [DISCUSS] Direction of HBCK2

2019-05-30 Thread Josh Elser
It sounds to me like you're saying: "No, I don't think compose-able 
tools are a sufficient substitute in HBCK2 for what HBCK1 did".


I'm going to just delete everything else I want to write because it's 
going to turn into a massive argument and de-rail this further. For a 
second time, please stop the complaints about things that don't exist on 
this thread. We all know this already.


On 5/30/19 12:58 PM, Andrew Purtell wrote:

I did a both barrels type response to a suggestion Wellington made that I hope 
communicates the right level of dismay at the prevailing line of thought in 
this thread.

Let me say I agree hbck 1 was sometimes oversold as a magic tool.

However if you analyze all of its options and then look to branch 2, where are 
the gaps. In branch 1 there is a command line tool that can be executed by 
operations and first level support. Its options can be described in a runbook 
with cut and paste examples. In branch 2 ... ?

There appears no ready solution for detecting and deploying undeployed 
“missing” regions.

There appears no ready solution for fixing a failed split or merge or other 
corruption producing a hole or overlap in the region chain.

There appears no tool capable of rebuilding meta from scratch from HDFS level 
metadata; a last but crucial resort as this is what holds the line against a 
complete and time intensive restore from backup.

I may have an incorrect impression of some of this. If so that would be a big 
relief. If not these are suggested areas of focus.

I’m not saying that 2 needs Hbck exactly as it is in 1. However the lack of simple 
recovery tools or actions that can be taken by a non expert guided by a runbook means 
the risk to operations when there is the inevitable problem is higher. And I don’t 
mean theoretical problems. I mean the commonly occurring issues Hbck 1 was coded up 
to address in a mostly automated way, like failed splits or failed deployments or 
simple HDFS level corruptions like loss of meta region hfiles. Lacking simple tooling 
our operations will have to do  more complex, labor intensive, and 
or risky. This factors in to the major version upgrade risk analysis.

What I would advise is an analysis that enumerates all of the risks and 
specific conditions Hbck 1 addresses, then excludes those not relevant for the 
2 code base, then excludes those which have easy and simple tools existing 
right now to solve. What you have left is a list of action items. Then there 
should be an analysis of the new risks in 2 given AMv2s theory of operation, 
for example for each procedure based action if the procedure is always failing 
how can the operator recover the prerequisites for successful completion, and 
provide a simple tool or option for applying a fix or remediation to cluster 
state.



On May 30, 2019, at 7:16 AM, Josh Elser  wrote:

Right, this discuss isn't meant to be implying that any of this exists -- 
instead, I wanted to make sure we're focused on building tooling which both 
devs and users will find usable and effective.

What's your gut-reaction to what I suggested? I think you're saying you see operators having to 
apply more understanding/insight to fix a "complex problem" as taking on more risk which 
you'd have to weigh. In other words, anything less than the verbatim "fix these problems" 
flags you mentioned earlier would require you to do the risk-analysis math if moving to HBase2?

Thanks for your insights.


On 5/29/19 4:45 PM, Andrew Purtell wrote:
I have yet to see essential HBCK functions in 1 replaced by anything -
documentation, script, hbck2, whatever.
Do we have a tool or script in HBase 2 that can rebuild meta from HDFS
state? This would be faster than a complete restore from backup. It would
be useful and important to offer this option to operators, but not
essential, because it could be valid to say if meta is screwed so are you
and you have to restore completely from backup. Meta is small, a fraction
of total data footprint. Seems a real shame to impose such a high cost when
there could be an alternative. I'd have to think for a while about
accepting this kind of operational risk when HBase 1 has such tooling.
What I am more worried about is this: Do we have a tool or script in HBase
2 that can fix errors in the region chain caused by failed splits, failed
merges, or double assignment? It seems not, and the implications for
service availability are not good when compared with HBase 1. With HBase 1,
hbck is an option. Sure, it has a lot of problematic aspects, but I have
seen it recover a cluster's total availability with fairly fast execution.
It could be valid, not saying I agree with this point, to clearly document
that all aspects of recovery from corrupted metadata is the responsibility
of the operator, at least this is full disclosure. We can then weigh the
cost and risk associated with this policy when deciding if ever to upgrade.

On Wed, May 29, 2019 at 1:13 PM Josh Elser  wrote:
My understanding was 

Re: [DISCUSS] Direction of HBCK2

2019-05-30 Thread Andrew Purtell
I did a both barrels type response to a suggestion Wellington made that I hope 
communicates the right level of dismay at the prevailing line of thought in 
this thread. 

Let me say I agree hbck 1 was sometimes oversold as a magic tool. 

However if you analyze all of its options and then look to branch 2, where are 
the gaps. In branch 1 there is a command line tool that can be executed by 
operations and first level support. Its options can be described in a runbook 
with cut and paste examples. In branch 2 ... ?

There appears no ready solution for detecting and deploying undeployed 
“missing” regions. 

There appears no ready solution for fixing a failed split or merge or other 
corruption producing a hole or overlap in the region chain. 

There appears no tool capable of rebuilding meta from scratch from HDFS level 
metadata; a last but crucial resort as this is what holds the line against a 
complete and time intensive restore from backup. 

I may have an incorrect impression of some of this. If so that would be a big 
relief. If not these are suggested areas of focus. 

I’m not saying that 2 needs Hbck exactly as it is in 1. However the lack of 
simple recovery tools or actions that can be taken by a non expert guided by a 
runbook means the risk to operations when there is the inevitable problem is 
higher. And I don’t mean theoretical problems. I mean the commonly occurring 
issues Hbck 1 was coded up to address in a mostly automated way, like failed 
splits or failed deployments or simple HDFS level corruptions like loss of meta 
region hfiles. Lacking simple tooling our operations will have to do 
 more complex, labor intensive, and or risky. This factors in to the 
major version upgrade risk analysis. 

What I would advise is an analysis that enumerates all of the risks and 
specific conditions Hbck 1 addresses, then excludes those not relevant for the 
2 code base, then excludes those which have easy and simple tools existing 
right now to solve. What you have left is a list of action items. Then there 
should be an analysis of the new risks in 2 given AMv2s theory of operation, 
for example for each procedure based action if the procedure is always failing 
how can the operator recover the prerequisites for successful completion, and 
provide a simple tool or option for applying a fix or remediation to cluster 
state. 


> On May 30, 2019, at 7:16 AM, Josh Elser  wrote:
> 
> Right, this discuss isn't meant to be implying that any of this exists -- 
> instead, I wanted to make sure we're focused on building tooling which both 
> devs and users will find usable and effective.
> 
> What's your gut-reaction to what I suggested? I think you're saying you see 
> operators having to apply more understanding/insight to fix a "complex 
> problem" as taking on more risk which you'd have to weigh. In other words, 
> anything less than the verbatim "fix these problems" flags you mentioned 
> earlier would require you to do the risk-analysis math if moving to HBase2?
> 
> Thanks for your insights.
> 
>> On 5/29/19 4:45 PM, Andrew Purtell wrote:
>> I have yet to see essential HBCK functions in 1 replaced by anything -
>> documentation, script, hbck2, whatever.
>> Do we have a tool or script in HBase 2 that can rebuild meta from HDFS
>> state? This would be faster than a complete restore from backup. It would
>> be useful and important to offer this option to operators, but not
>> essential, because it could be valid to say if meta is screwed so are you
>> and you have to restore completely from backup. Meta is small, a fraction
>> of total data footprint. Seems a real shame to impose such a high cost when
>> there could be an alternative. I'd have to think for a while about
>> accepting this kind of operational risk when HBase 1 has such tooling.
>> What I am more worried about is this: Do we have a tool or script in HBase
>> 2 that can fix errors in the region chain caused by failed splits, failed
>> merges, or double assignment? It seems not, and the implications for
>> service availability are not good when compared with HBase 1. With HBase 1,
>> hbck is an option. Sure, it has a lot of problematic aspects, but I have
>> seen it recover a cluster's total availability with fairly fast execution.
>> It could be valid, not saying I agree with this point, to clearly document
>> that all aspects of recovery from corrupted metadata is the responsibility
>> of the operator, at least this is full disclosure. We can then weigh the
>> cost and risk associated with this policy when deciding if ever to upgrade.
>>> On Wed, May 29, 2019 at 1:13 PM Josh Elser  wrote:
>>> My understanding was that recreating sweeping "fix it" flags was an
>>> anti-goal of HBCK2, but I'm surprised a grey-beard hasn't come in to say
>>> confirm/dispute that :). I could be taking that out of context or my dog
>>> remembers things better than I do.
>>> 
>>> The reasoning behind this line of thinking for HBCK2 is:
>>> 
>>> * Smaller 

Re: [DISCUSS] Direction of HBCK2

2019-05-30 Thread Andrew Purtell
Hbck was never magic but at least it was a real tool that often worked. 

This discussion keeps resorting to loosly defined hypotheticals which are not 
real tools available right now. They are not real right now so are all beside 
the point. You have nothing. Worse these suggestions would seem to replace the 
execution of one command line tool with a complex set of actions requiring 
coding or scripting. Below breezy suggestion is the most alarming yet and puts 
the onus on the operator to understand internals at the level of a core 
developer and assume all responsibility for recovery. This kind of thinking is 
disconnected from what you can expect of operators and users who want to deploy 
and use a data store not develop it. Unless the expectation is only core 
developers of HBase will ever run 2 in production, or must purchase a license 
for a commercial distribution which includes such tooling.


On May 30, 2019, at 1:59 AM, Wellington Chevreuil 
 wrote:

>> 
>> It seemed like the table data in HDFS was intact but they lost some meta
>> data
>> (in hbase:meta) of the table. So I needed to rebuild the meta from HDFS
>> data.
>> In this case, we can still fix with some combinations of commands today? If
>> so,
>> I would appreciate it if you could suggest the steps to me
> 
> Yeah, there's no single command here, an alternative would be to combine
> merges and bulkloading, for example, say you had regions A, B, C, now meta
> has only A and C, with a hole where you should have B. How about merging A
> and C, then bulkloading B files into the table? Sure, that's much more
> laborious than the magic hbck1 fix, but it's my (same) understanding of the
> hbck2 goals described by Josh earlier. I understand the concerns, and
> Andrew's argument about time to recover operation is a solid one. Maybe
> worth revisit and vote which hbck1 former options are seen as essential by
> the majority? From this discussion so far, it seems the most missed are
> fixMeta, fixHoles and fixOverlaps?
> 
> 
>> Em qua, 29 de mai de 2019 às 23:10, Stack  escreveu:
>> 
>> Would be good to do a bit of evangelizing that hbck2 is intentionally not
>> meant to be like hbck1. hbck1 gave off the impression that it could fix
>> "all" problems, rebuilding master functionality on the exterior in a
>> contending script. Re-reading the hbck2 home page [1], hoping to find a
>> quote to back Josh's perception, it is plain the text needs to state more
>> forcefully the difference in philosophy.
>> 
>> On missing hbck2 functionality, there is an outstanding task (HBASE-21745)
>> sorting what is needed from hbck1 hangovers so the likes of our Andrew has
>> confidence that should he hit an operational issue, he'll have tooling for
>> repair. Let's be judicious in what we add to hbck2. We've left behind many
>> of the problems hbck1 used 'fix'. A rebuild of meta should disaster hits
>> makes sense (and is a long-time ask). Fixup for the mess JMS is able to
>> make upgrading from hbase1 to hbase2 makes sense too since this is what our
>> users will be doing (File JIRAs w/ detail on the mess JMS?). Andrew made a
>> list a while back here that needs consideration (HBASE-21745).
>> 
>> S
>> 
>> 1. https://github.com/apache/hbase-operator-tools/tree/master/hbase-hbck2
>> 
>> 
>> 
>> On Wed, May 29, 2019 at 8:55 AM Andrew Purtell 
>> wrote:
>> 
>>> To me this is a succinct specification of minimum functionality for a
>>> recovery tool: using on disk bits, rebuild meta table, with end result a
>>> working cluster that did not miss any data during the reconstruction.
>>> 
>>> Of course focusing on root causes of metadata mismanagement is
>> appropriate
>>> when investigating a specific incident, but this is orthogonal from the
>>> question of whether or not recovery is possible after a bug corrupts
>>> metadata. It is customary for filesystems and databases to ship with a
>> tool
>>> that attempts recovery after corruption, on the (correct, IMHO)
>> assumption
>>> that corruption is inevitable, either due to logic bug, hardware
>> problems,
>>> or operator error.
>>> 
>>> The features of hbck in HBase 1 that have resolved availability problems
>>> where I work are: fixMeta, fixAssignments, fixHdfsHoles, fixHdfsOverlaps.
>>> In HBaseFsck.java in branch-2 these are all in the unsupported options
>> set.
>>> Because these are all lacking in HBase 2 I will not certify it ready for
>>> production to my employer. If there is some other tool which offers these
>>> recovery options I'm not aware of it nor documentation for it and would
>>> appreciate a pointer if you have one.
>>> 
>>> 
>>> On Wed, May 29, 2019 at 7:11 AM Toshihiro Suzuki 
>>> wrote:
>>> 
 Thanks Wellington.
 
> I guess those can still be fixed with some combinations of commands
 today,
> such as merge/assign.
 
 Let me explain the situation I faced in the customer's cluster a little
>>> bit
 more.
 It seemed like the table data in HDFS was intact but they lost 

Re: [DISCUSS] Direction of HBCK2

2019-05-30 Thread Josh Elser
Right, this discuss isn't meant to be implying that any of this exists 
-- instead, I wanted to make sure we're focused on building tooling 
which both devs and users will find usable and effective.


What's your gut-reaction to what I suggested? I think you're saying you 
see operators having to apply more understanding/insight to fix a 
"complex problem" as taking on more risk which you'd have to weigh. In 
other words, anything less than the verbatim "fix these problems" flags 
you mentioned earlier would require you to do the risk-analysis math if 
moving to HBase2?


Thanks for your insights.

On 5/29/19 4:45 PM, Andrew Purtell wrote:

I have yet to see essential HBCK functions in 1 replaced by anything -
documentation, script, hbck2, whatever.

Do we have a tool or script in HBase 2 that can rebuild meta from HDFS
state? This would be faster than a complete restore from backup. It would
be useful and important to offer this option to operators, but not
essential, because it could be valid to say if meta is screwed so are you
and you have to restore completely from backup. Meta is small, a fraction
of total data footprint. Seems a real shame to impose such a high cost when
there could be an alternative. I'd have to think for a while about
accepting this kind of operational risk when HBase 1 has such tooling.

What I am more worried about is this: Do we have a tool or script in HBase
2 that can fix errors in the region chain caused by failed splits, failed
merges, or double assignment? It seems not, and the implications for
service availability are not good when compared with HBase 1. With HBase 1,
hbck is an option. Sure, it has a lot of problematic aspects, but I have
seen it recover a cluster's total availability with fairly fast execution.

It could be valid, not saying I agree with this point, to clearly document
that all aspects of recovery from corrupted metadata is the responsibility
of the operator, at least this is full disclosure. We can then weigh the
cost and risk associated with this policy when deciding if ever to upgrade.


On Wed, May 29, 2019 at 1:13 PM Josh Elser  wrote:


My understanding was that recreating sweeping "fix it" flags was an
anti-goal of HBCK2, but I'm surprised a grey-beard hasn't come in to say
confirm/dispute that :). I could be taking that out of context or my dog
remembers things better than I do.

The reasoning behind this line of thinking for HBCK2 is:

* Smaller actions are easier to implement correctly and be well-tested
* The more complex the action, the more likely it is for something we
(as devs) didn't expect to happen which results in a bug.

The "stretch" in my mind is that we can string together small actions to
recreate the bigger ones (the fix* type commands from hbck1), *but*
teach operators to apply knowledge about their cluster instead of
treating hbck like a black box.

For example, if we try to decompose something like fixAssignments into
something like: `for region in $(list non-open regions); do assign
$region; end`. As developers, we don't have to catch every edge case of
_something_ that might be specific to the admin's actual situation (e.g.
what if a table is disabled and we don't want to assign those regions)
and it lets us write better test cases.

Again, this is what I have floating around in my head -- nothing more
than that at present.

On 5/29/19 11:54 AM, Andrew Purtell wrote:

To me this is a succinct specification of minimum functionality for a
recovery tool: using on disk bits, rebuild meta table, with end result a
working cluster that did not miss any data during the reconstruction.

Of course focusing on root causes of metadata mismanagement is

appropriate

when investigating a specific incident, but this is orthogonal from the
question of whether or not recovery is possible after a bug corrupts
metadata. It is customary for filesystems and databases to ship with a

tool

that attempts recovery after corruption, on the (correct, IMHO)

assumption

that corruption is inevitable, either due to logic bug, hardware

problems,

or operator error.

The features of hbck in HBase 1 that have resolved availability problems
where I work are: fixMeta, fixAssignments, fixHdfsHoles, fixHdfsOverlaps.
In HBaseFsck.java in branch-2 these are all in the unsupported options

set.

Because these are all lacking in HBase 2 I will not certify it ready for
production to my employer. If there is some other tool which offers these
recovery options I'm not aware of it nor documentation for it and would
appreciate a pointer if you have one.


On Wed, May 29, 2019 at 7:11 AM Toshihiro Suzuki 
wrote:


Thanks Wellington.


I guess those can still be fixed with some combinations of commands

today,

such as merge/assign.


Let me explain the situation I faced in the customer's cluster a little

bit

more.
It seemed like the table data in HDFS was intact but they lost some meta
data
(in hbase:meta) of the table. So I needed to rebuild the meta from HDFS

Re: [DISCUSS] Direction of HBCK2

2019-05-30 Thread Wellington Chevreuil
>
> It seemed like the table data in HDFS was intact but they lost some meta
> data
> (in hbase:meta) of the table. So I needed to rebuild the meta from HDFS
> data.
> In this case, we can still fix with some combinations of commands today? If
> so,
> I would appreciate it if you could suggest the steps to me
>

Yeah, there's no single command here, an alternative would be to combine
merges and bulkloading, for example, say you had regions A, B, C, now meta
has only A and C, with a hole where you should have B. How about merging A
and C, then bulkloading B files into the table? Sure, that's much more
laborious than the magic hbck1 fix, but it's my (same) understanding of the
hbck2 goals described by Josh earlier. I understand the concerns, and
Andrew's argument about time to recover operation is a solid one. Maybe
worth revisit and vote which hbck1 former options are seen as essential by
the majority? From this discussion so far, it seems the most missed are
fixMeta, fixHoles and fixOverlaps?


Em qua, 29 de mai de 2019 às 23:10, Stack  escreveu:

> Would be good to do a bit of evangelizing that hbck2 is intentionally not
> meant to be like hbck1. hbck1 gave off the impression that it could fix
> "all" problems, rebuilding master functionality on the exterior in a
> contending script. Re-reading the hbck2 home page [1], hoping to find a
> quote to back Josh's perception, it is plain the text needs to state more
> forcefully the difference in philosophy.
>
> On missing hbck2 functionality, there is an outstanding task (HBASE-21745)
> sorting what is needed from hbck1 hangovers so the likes of our Andrew has
> confidence that should he hit an operational issue, he'll have tooling for
> repair. Let's be judicious in what we add to hbck2. We've left behind many
> of the problems hbck1 used 'fix'. A rebuild of meta should disaster hits
> makes sense (and is a long-time ask). Fixup for the mess JMS is able to
> make upgrading from hbase1 to hbase2 makes sense too since this is what our
> users will be doing (File JIRAs w/ detail on the mess JMS?). Andrew made a
> list a while back here that needs consideration (HBASE-21745).
>
> S
>
> 1. https://github.com/apache/hbase-operator-tools/tree/master/hbase-hbck2
>
>
>
> On Wed, May 29, 2019 at 8:55 AM Andrew Purtell 
> wrote:
>
> > To me this is a succinct specification of minimum functionality for a
> > recovery tool: using on disk bits, rebuild meta table, with end result a
> > working cluster that did not miss any data during the reconstruction.
> >
> > Of course focusing on root causes of metadata mismanagement is
> appropriate
> > when investigating a specific incident, but this is orthogonal from the
> > question of whether or not recovery is possible after a bug corrupts
> > metadata. It is customary for filesystems and databases to ship with a
> tool
> > that attempts recovery after corruption, on the (correct, IMHO)
> assumption
> > that corruption is inevitable, either due to logic bug, hardware
> problems,
> > or operator error.
> >
> > The features of hbck in HBase 1 that have resolved availability problems
> > where I work are: fixMeta, fixAssignments, fixHdfsHoles, fixHdfsOverlaps.
> > In HBaseFsck.java in branch-2 these are all in the unsupported options
> set.
> > Because these are all lacking in HBase 2 I will not certify it ready for
> > production to my employer. If there is some other tool which offers these
> > recovery options I'm not aware of it nor documentation for it and would
> > appreciate a pointer if you have one.
> >
> >
> > On Wed, May 29, 2019 at 7:11 AM Toshihiro Suzuki 
> > wrote:
> >
> > > Thanks Wellington.
> > >
> > > > I guess those can still be fixed with some combinations of commands
> > > today,
> > > > such as merge/assign.
> > >
> > > Let me explain the situation I faced in the customer's cluster a little
> > bit
> > > more.
> > > It seemed like the table data in HDFS was intact but they lost some
> meta
> > > data
> > > (in hbase:meta) of the table. So I needed to rebuild the meta from HDFS
> > > data.
> > > In this case, we can still fix with some combinations of commands
> today?
> > If
> > > so,
> > > I would appreciate it if you could suggest the steps to me.
> > >
> > > > And focus on fixing the main root cause of such problems, as a mean
> to
> > > > soften the need of use such commands.
> > >
> > > Yes, correct. Actually I usually do that. But I didn't do that in that
> > > case..
> > >
> > >
> > > On Wed, May 29, 2019 at 5:47 AM Wellington Chevreuil <
> > > wellington.chevre...@gmail.com> wrote:
> > >
> > > > Thanks Toshihiro! I guess those can still be fixed with some
> > combinations
> > > > of commands today, such as merge/assign. Of course, it requires some
> > > extra
> > > > scripting and log reading on cases where many regions are in an
> > > > inconsistent state, maybe we should work on provide a one liner
> command
> > > > that relies on the current existing ones. And focus on fixing the
> main
> > > root

Re: [DISCUSS] Direction of HBCK2

2019-05-29 Thread Stack
Would be good to do a bit of evangelizing that hbck2 is intentionally not
meant to be like hbck1. hbck1 gave off the impression that it could fix
"all" problems, rebuilding master functionality on the exterior in a
contending script. Re-reading the hbck2 home page [1], hoping to find a
quote to back Josh's perception, it is plain the text needs to state more
forcefully the difference in philosophy.

On missing hbck2 functionality, there is an outstanding task (HBASE-21745)
sorting what is needed from hbck1 hangovers so the likes of our Andrew has
confidence that should he hit an operational issue, he'll have tooling for
repair. Let's be judicious in what we add to hbck2. We've left behind many
of the problems hbck1 used 'fix'. A rebuild of meta should disaster hits
makes sense (and is a long-time ask). Fixup for the mess JMS is able to
make upgrading from hbase1 to hbase2 makes sense too since this is what our
users will be doing (File JIRAs w/ detail on the mess JMS?). Andrew made a
list a while back here that needs consideration (HBASE-21745).

S

1. https://github.com/apache/hbase-operator-tools/tree/master/hbase-hbck2



On Wed, May 29, 2019 at 8:55 AM Andrew Purtell  wrote:

> To me this is a succinct specification of minimum functionality for a
> recovery tool: using on disk bits, rebuild meta table, with end result a
> working cluster that did not miss any data during the reconstruction.
>
> Of course focusing on root causes of metadata mismanagement is appropriate
> when investigating a specific incident, but this is orthogonal from the
> question of whether or not recovery is possible after a bug corrupts
> metadata. It is customary for filesystems and databases to ship with a tool
> that attempts recovery after corruption, on the (correct, IMHO) assumption
> that corruption is inevitable, either due to logic bug, hardware problems,
> or operator error.
>
> The features of hbck in HBase 1 that have resolved availability problems
> where I work are: fixMeta, fixAssignments, fixHdfsHoles, fixHdfsOverlaps.
> In HBaseFsck.java in branch-2 these are all in the unsupported options set.
> Because these are all lacking in HBase 2 I will not certify it ready for
> production to my employer. If there is some other tool which offers these
> recovery options I'm not aware of it nor documentation for it and would
> appreciate a pointer if you have one.
>
>
> On Wed, May 29, 2019 at 7:11 AM Toshihiro Suzuki 
> wrote:
>
> > Thanks Wellington.
> >
> > > I guess those can still be fixed with some combinations of commands
> > today,
> > > such as merge/assign.
> >
> > Let me explain the situation I faced in the customer's cluster a little
> bit
> > more.
> > It seemed like the table data in HDFS was intact but they lost some meta
> > data
> > (in hbase:meta) of the table. So I needed to rebuild the meta from HDFS
> > data.
> > In this case, we can still fix with some combinations of commands today?
> If
> > so,
> > I would appreciate it if you could suggest the steps to me.
> >
> > > And focus on fixing the main root cause of such problems, as a mean to
> > > soften the need of use such commands.
> >
> > Yes, correct. Actually I usually do that. But I didn't do that in that
> > case..
> >
> >
> > On Wed, May 29, 2019 at 5:47 AM Wellington Chevreuil <
> > wellington.chevre...@gmail.com> wrote:
> >
> > > Thanks Toshihiro! I guess those can still be fixed with some
> combinations
> > > of commands today, such as merge/assign. Of course, it requires some
> > extra
> > > scripting and log reading on cases where many regions are in an
> > > inconsistent state, maybe we should work on provide a one liner command
> > > that relies on the current existing ones. And focus on fixing the main
> > root
> > > cause of such problems, as a mean to soften the need of use such
> > commands.
> > >
> > > I'm not really a fan of offlinemetarepair, nor hbck1 fix
> holes/overlaps,
> > > would rather not have those back. Sure those are easy and convenient to
> > > trigger, but hbck1 reports are sometimes misleading (for instance, it
> > > reports holes when region(s) on the chain is/are simply not online),
> and
> > > that, combined with availability of such heavy hammers had led
> > > unexperienced operators to fall into running it and getting into a
> worse
> > > state.
> > >
> > > Em qua, 29 de mai de 2019 às 13:22, Toshihiro Suzuki <
> > brfrn...@apache.org>
> > > escreveu:
> > >
> > > > Hi Wellington,
> > > >
> > > > I saw table holes in a customer's cluster actually, and I just fixed
> > the
> > > > issues
> > > > by the workaround I mentioned in HBASE-21665
> > > >  and I didn't dig
> > the
> > > > reason
> > > > why the table holes happened at that time because the customer didn't
> > > want.
> > > >
> > > > However, IMO, whatever the reason I think we should have a direct way
> > to
> > > > fix
> > > > holes and overlaps.
> > > >
> > > > On Wed, May 29, 2019 at 4:57 AM 

Re: [DISCUSS] Direction of HBCK2

2019-05-29 Thread Andrew Purtell
I have yet to see essential HBCK functions in 1 replaced by anything -
documentation, script, hbck2, whatever.

Do we have a tool or script in HBase 2 that can rebuild meta from HDFS
state? This would be faster than a complete restore from backup. It would
be useful and important to offer this option to operators, but not
essential, because it could be valid to say if meta is screwed so are you
and you have to restore completely from backup. Meta is small, a fraction
of total data footprint. Seems a real shame to impose such a high cost when
there could be an alternative. I'd have to think for a while about
accepting this kind of operational risk when HBase 1 has such tooling.

What I am more worried about is this: Do we have a tool or script in HBase
2 that can fix errors in the region chain caused by failed splits, failed
merges, or double assignment? It seems not, and the implications for
service availability are not good when compared with HBase 1. With HBase 1,
hbck is an option. Sure, it has a lot of problematic aspects, but I have
seen it recover a cluster's total availability with fairly fast execution.

It could be valid, not saying I agree with this point, to clearly document
that all aspects of recovery from corrupted metadata is the responsibility
of the operator, at least this is full disclosure. We can then weigh the
cost and risk associated with this policy when deciding if ever to upgrade.


On Wed, May 29, 2019 at 1:13 PM Josh Elser  wrote:

> My understanding was that recreating sweeping "fix it" flags was an
> anti-goal of HBCK2, but I'm surprised a grey-beard hasn't come in to say
> confirm/dispute that :). I could be taking that out of context or my dog
> remembers things better than I do.
>
> The reasoning behind this line of thinking for HBCK2 is:
>
> * Smaller actions are easier to implement correctly and be well-tested
> * The more complex the action, the more likely it is for something we
> (as devs) didn't expect to happen which results in a bug.
>
> The "stretch" in my mind is that we can string together small actions to
> recreate the bigger ones (the fix* type commands from hbck1), *but*
> teach operators to apply knowledge about their cluster instead of
> treating hbck like a black box.
>
> For example, if we try to decompose something like fixAssignments into
> something like: `for region in $(list non-open regions); do assign
> $region; end`. As developers, we don't have to catch every edge case of
> _something_ that might be specific to the admin's actual situation (e.g.
> what if a table is disabled and we don't want to assign those regions)
> and it lets us write better test cases.
>
> Again, this is what I have floating around in my head -- nothing more
> than that at present.
>
> On 5/29/19 11:54 AM, Andrew Purtell wrote:
> > To me this is a succinct specification of minimum functionality for a
> > recovery tool: using on disk bits, rebuild meta table, with end result a
> > working cluster that did not miss any data during the reconstruction.
> >
> > Of course focusing on root causes of metadata mismanagement is
> appropriate
> > when investigating a specific incident, but this is orthogonal from the
> > question of whether or not recovery is possible after a bug corrupts
> > metadata. It is customary for filesystems and databases to ship with a
> tool
> > that attempts recovery after corruption, on the (correct, IMHO)
> assumption
> > that corruption is inevitable, either due to logic bug, hardware
> problems,
> > or operator error.
> >
> > The features of hbck in HBase 1 that have resolved availability problems
> > where I work are: fixMeta, fixAssignments, fixHdfsHoles, fixHdfsOverlaps.
> > In HBaseFsck.java in branch-2 these are all in the unsupported options
> set.
> > Because these are all lacking in HBase 2 I will not certify it ready for
> > production to my employer. If there is some other tool which offers these
> > recovery options I'm not aware of it nor documentation for it and would
> > appreciate a pointer if you have one.
> >
> >
> > On Wed, May 29, 2019 at 7:11 AM Toshihiro Suzuki 
> > wrote:
> >
> >> Thanks Wellington.
> >>
> >>> I guess those can still be fixed with some combinations of commands
> >> today,
> >>> such as merge/assign.
> >>
> >> Let me explain the situation I faced in the customer's cluster a little
> bit
> >> more.
> >> It seemed like the table data in HDFS was intact but they lost some meta
> >> data
> >> (in hbase:meta) of the table. So I needed to rebuild the meta from HDFS
> >> data.
> >> In this case, we can still fix with some combinations of commands
> today? If
> >> so,
> >> I would appreciate it if you could suggest the steps to me.
> >>
> >>> And focus on fixing the main root cause of such problems, as a mean to
> >>> soften the need of use such commands.
> >>
> >> Yes, correct. Actually I usually do that. But I didn't do that in that
> >> case..
> >>
> >>
> >> On Wed, May 29, 2019 at 5:47 AM Wellington Chevreuil <
> >> 

Re: [DISCUSS] Direction of HBCK2

2019-05-29 Thread Josh Elser
My understanding was that recreating sweeping "fix it" flags was an 
anti-goal of HBCK2, but I'm surprised a grey-beard hasn't come in to say 
confirm/dispute that :). I could be taking that out of context or my dog 
remembers things better than I do.


The reasoning behind this line of thinking for HBCK2 is:

* Smaller actions are easier to implement correctly and be well-tested
* The more complex the action, the more likely it is for something we 
(as devs) didn't expect to happen which results in a bug.


The "stretch" in my mind is that we can string together small actions to 
recreate the bigger ones (the fix* type commands from hbck1), *but* 
teach operators to apply knowledge about their cluster instead of 
treating hbck like a black box.


For example, if we try to decompose something like fixAssignments into 
something like: `for region in $(list non-open regions); do assign 
$region; end`. As developers, we don't have to catch every edge case of 
_something_ that might be specific to the admin's actual situation (e.g. 
what if a table is disabled and we don't want to assign those regions) 
and it lets us write better test cases.


Again, this is what I have floating around in my head -- nothing more 
than that at present.


On 5/29/19 11:54 AM, Andrew Purtell wrote:

To me this is a succinct specification of minimum functionality for a
recovery tool: using on disk bits, rebuild meta table, with end result a
working cluster that did not miss any data during the reconstruction.

Of course focusing on root causes of metadata mismanagement is appropriate
when investigating a specific incident, but this is orthogonal from the
question of whether or not recovery is possible after a bug corrupts
metadata. It is customary for filesystems and databases to ship with a tool
that attempts recovery after corruption, on the (correct, IMHO) assumption
that corruption is inevitable, either due to logic bug, hardware problems,
or operator error.

The features of hbck in HBase 1 that have resolved availability problems
where I work are: fixMeta, fixAssignments, fixHdfsHoles, fixHdfsOverlaps.
In HBaseFsck.java in branch-2 these are all in the unsupported options set.
Because these are all lacking in HBase 2 I will not certify it ready for
production to my employer. If there is some other tool which offers these
recovery options I'm not aware of it nor documentation for it and would
appreciate a pointer if you have one.


On Wed, May 29, 2019 at 7:11 AM Toshihiro Suzuki 
wrote:


Thanks Wellington.


I guess those can still be fixed with some combinations of commands

today,

such as merge/assign.


Let me explain the situation I faced in the customer's cluster a little bit
more.
It seemed like the table data in HDFS was intact but they lost some meta
data
(in hbase:meta) of the table. So I needed to rebuild the meta from HDFS
data.
In this case, we can still fix with some combinations of commands today? If
so,
I would appreciate it if you could suggest the steps to me.


And focus on fixing the main root cause of such problems, as a mean to
soften the need of use such commands.


Yes, correct. Actually I usually do that. But I didn't do that in that
case..


On Wed, May 29, 2019 at 5:47 AM Wellington Chevreuil <
wellington.chevre...@gmail.com> wrote:


Thanks Toshihiro! I guess those can still be fixed with some combinations
of commands today, such as merge/assign. Of course, it requires some

extra

scripting and log reading on cases where many regions are in an
inconsistent state, maybe we should work on provide a one liner command
that relies on the current existing ones. And focus on fixing the main

root

cause of such problems, as a mean to soften the need of use such

commands.


I'm not really a fan of offlinemetarepair, nor hbck1 fix holes/overlaps,
would rather not have those back. Sure those are easy and convenient to
trigger, but hbck1 reports are sometimes misleading (for instance, it
reports holes when region(s) on the chain is/are simply not online), and
that, combined with availability of such heavy hammers had led
unexperienced operators to fall into running it and getting into a worse
state.

Em qua, 29 de mai de 2019 às 13:22, Toshihiro Suzuki <

brfrn...@apache.org>

escreveu:


Hi Wellington,

I saw table holes in a customer's cluster actually, and I just fixed

the

issues
by the workaround I mentioned in HBASE-21665
 and I didn't dig

the

reason
why the table holes happened at that time because the customer didn't

want.


However, IMO, whatever the reason I think we should have a direct way

to

fix
holes and overlaps.

On Wed, May 29, 2019 at 4:57 AM Wellington Chevreuil <
wellington.chevre...@gmail.com> wrote:


So JMS, Toshihiro, seems like upgrading from some 1.x to 2.x

consistently

triggers this problem? Do you guys know if there are any bug jiras

open

that would cover these scenarios? If not, and if you guys have enough
resources 

Re: [DISCUSS] Direction of HBCK2

2019-05-29 Thread Andrew Purtell
To me this is a succinct specification of minimum functionality for a
recovery tool: using on disk bits, rebuild meta table, with end result a
working cluster that did not miss any data during the reconstruction.

Of course focusing on root causes of metadata mismanagement is appropriate
when investigating a specific incident, but this is orthogonal from the
question of whether or not recovery is possible after a bug corrupts
metadata. It is customary for filesystems and databases to ship with a tool
that attempts recovery after corruption, on the (correct, IMHO) assumption
that corruption is inevitable, either due to logic bug, hardware problems,
or operator error.

The features of hbck in HBase 1 that have resolved availability problems
where I work are: fixMeta, fixAssignments, fixHdfsHoles, fixHdfsOverlaps.
In HBaseFsck.java in branch-2 these are all in the unsupported options set.
Because these are all lacking in HBase 2 I will not certify it ready for
production to my employer. If there is some other tool which offers these
recovery options I'm not aware of it nor documentation for it and would
appreciate a pointer if you have one.


On Wed, May 29, 2019 at 7:11 AM Toshihiro Suzuki 
wrote:

> Thanks Wellington.
>
> > I guess those can still be fixed with some combinations of commands
> today,
> > such as merge/assign.
>
> Let me explain the situation I faced in the customer's cluster a little bit
> more.
> It seemed like the table data in HDFS was intact but they lost some meta
> data
> (in hbase:meta) of the table. So I needed to rebuild the meta from HDFS
> data.
> In this case, we can still fix with some combinations of commands today? If
> so,
> I would appreciate it if you could suggest the steps to me.
>
> > And focus on fixing the main root cause of such problems, as a mean to
> > soften the need of use such commands.
>
> Yes, correct. Actually I usually do that. But I didn't do that in that
> case..
>
>
> On Wed, May 29, 2019 at 5:47 AM Wellington Chevreuil <
> wellington.chevre...@gmail.com> wrote:
>
> > Thanks Toshihiro! I guess those can still be fixed with some combinations
> > of commands today, such as merge/assign. Of course, it requires some
> extra
> > scripting and log reading on cases where many regions are in an
> > inconsistent state, maybe we should work on provide a one liner command
> > that relies on the current existing ones. And focus on fixing the main
> root
> > cause of such problems, as a mean to soften the need of use such
> commands.
> >
> > I'm not really a fan of offlinemetarepair, nor hbck1 fix holes/overlaps,
> > would rather not have those back. Sure those are easy and convenient to
> > trigger, but hbck1 reports are sometimes misleading (for instance, it
> > reports holes when region(s) on the chain is/are simply not online), and
> > that, combined with availability of such heavy hammers had led
> > unexperienced operators to fall into running it and getting into a worse
> > state.
> >
> > Em qua, 29 de mai de 2019 às 13:22, Toshihiro Suzuki <
> brfrn...@apache.org>
> > escreveu:
> >
> > > Hi Wellington,
> > >
> > > I saw table holes in a customer's cluster actually, and I just fixed
> the
> > > issues
> > > by the workaround I mentioned in HBASE-21665
> > >  and I didn't dig
> the
> > > reason
> > > why the table holes happened at that time because the customer didn't
> > want.
> > >
> > > However, IMO, whatever the reason I think we should have a direct way
> to
> > > fix
> > > holes and overlaps.
> > >
> > > On Wed, May 29, 2019 at 4:57 AM Wellington Chevreuil <
> > > wellington.chevre...@gmail.com> wrote:
> > >
> > > > So JMS, Toshihiro, seems like upgrading from some 1.x to 2.x
> > consistently
> > > > triggers this problem? Do you guys know if there are any bug jiras
> open
> > > > that would cover these scenarios? If not, and if you guys have enough
> > > > resources for investigating it, maybe worth open a specific jira?
> > > >
> > > > Em qua, 29 de mai de 2019 às 11:40, Jean-Marc Spaggiari <
> > > > jean-m...@spaggiari.org> escreveu:
> > > >
> > > > > Personnaly, when I tried to upgrade from 1.4.x to 2.2.x I end up
> in a
> > > > > situation where my meta was empty and had to get it repaired, but
> > > lacked
> > > > > OfflineMetaRepair for 2.2.x so I just had to delete all my tables,
> > get
> > > a
> > > > > brand new installation, recreate the tables and bulkload back the
> > data
> > > > into
> > > > > them. Would have been happy to have a OfflineMetaRepair.
> > > > >
> > > > > But it's more like an experimental cluster than a production one...
> > > > >
> > > > > JMS
> > > > >
> > > > > Le mer. 29 mai 2019 à 06:36, Wellington Chevreuil <
> > > > > wellington.chevre...@gmail.com> a écrit :
> > > > >
> > > > > > Interesting, I haven't seen any cases where OfflineMetaRepair was
> > > > really
> > > > > > required, among our customer base (running cdh6.1.x/hbase2.1.1,
> > > > > > cdh6.2/hbase2.1.2). 

Re: [DISCUSS] Direction of HBCK2

2019-05-29 Thread Toshihiro Suzuki
Thanks Wellington.

> I guess those can still be fixed with some combinations of commands
today,
> such as merge/assign.

Let me explain the situation I faced in the customer's cluster a little bit
more.
It seemed like the table data in HDFS was intact but they lost some meta
data
(in hbase:meta) of the table. So I needed to rebuild the meta from HDFS
data.
In this case, we can still fix with some combinations of commands today? If
so,
I would appreciate it if you could suggest the steps to me.

> And focus on fixing the main root cause of such problems, as a mean to
> soften the need of use such commands.

Yes, correct. Actually I usually do that. But I didn't do that in that
case..


On Wed, May 29, 2019 at 5:47 AM Wellington Chevreuil <
wellington.chevre...@gmail.com> wrote:

> Thanks Toshihiro! I guess those can still be fixed with some combinations
> of commands today, such as merge/assign. Of course, it requires some extra
> scripting and log reading on cases where many regions are in an
> inconsistent state, maybe we should work on provide a one liner command
> that relies on the current existing ones. And focus on fixing the main root
> cause of such problems, as a mean to soften the need of use such commands.
>
> I'm not really a fan of offlinemetarepair, nor hbck1 fix holes/overlaps,
> would rather not have those back. Sure those are easy and convenient to
> trigger, but hbck1 reports are sometimes misleading (for instance, it
> reports holes when region(s) on the chain is/are simply not online), and
> that, combined with availability of such heavy hammers had led
> unexperienced operators to fall into running it and getting into a worse
> state.
>
> Em qua, 29 de mai de 2019 às 13:22, Toshihiro Suzuki 
> escreveu:
>
> > Hi Wellington,
> >
> > I saw table holes in a customer's cluster actually, and I just fixed the
> > issues
> > by the workaround I mentioned in HBASE-21665
> >  and I didn't dig the
> > reason
> > why the table holes happened at that time because the customer didn't
> want.
> >
> > However, IMO, whatever the reason I think we should have a direct way to
> > fix
> > holes and overlaps.
> >
> > On Wed, May 29, 2019 at 4:57 AM Wellington Chevreuil <
> > wellington.chevre...@gmail.com> wrote:
> >
> > > So JMS, Toshihiro, seems like upgrading from some 1.x to 2.x
> consistently
> > > triggers this problem? Do you guys know if there are any bug jiras open
> > > that would cover these scenarios? If not, and if you guys have enough
> > > resources for investigating it, maybe worth open a specific jira?
> > >
> > > Em qua, 29 de mai de 2019 às 11:40, Jean-Marc Spaggiari <
> > > jean-m...@spaggiari.org> escreveu:
> > >
> > > > Personnaly, when I tried to upgrade from 1.4.x to 2.2.x I end up in a
> > > > situation where my meta was empty and had to get it repaired, but
> > lacked
> > > > OfflineMetaRepair for 2.2.x so I just had to delete all my tables,
> get
> > a
> > > > brand new installation, recreate the tables and bulkload back the
> data
> > > into
> > > > them. Would have been happy to have a OfflineMetaRepair.
> > > >
> > > > But it's more like an experimental cluster than a production one...
> > > >
> > > > JMS
> > > >
> > > > Le mer. 29 mai 2019 à 06:36, Wellington Chevreuil <
> > > > wellington.chevre...@gmail.com> a écrit :
> > > >
> > > > > Interesting, I haven't seen any cases where OfflineMetaRepair was
> > > really
> > > > > required, among our customer base (running cdh6.1.x/hbase2.1.1,
> > > > > cdh6.2/hbase2.1.2). Majority of RITs issue I had came with on hbase
> > 2.x
> > > > > were related to APs/SCPs failures, most of which could be sorted
> with
> > > > hbck2
> > > > > commands available by then (in some cases, required some CLI
> > scripting
> > > to
> > > > > build up a "bulk" assign command).
> > > > >
> > > > > Em qua, 29 de mai de 2019 às 00:55, Toshihiro Suzuki <
> > > > brfrn...@apache.org>
> > > > > escreveu:
> > > > >
> > > > > > Hi Josh,
> > > > > >
> > > > > > Thank you for the explanation. I agree with the direction for
> > HBCK2.
> > > > > >
> > > > > > The problem I wanted to tell you in the Jira is that until we
> > > implement
> > > > > the
> > > > > > features
> > > > > > you mentioned, we don't have any direct way how to fix holes and
> > > > > overlaps.
> > > > > > The holes and overlaps can be created by bugs or operation
> errors,
> > > so I
> > > > > > think we
> > > > > > should be able to fix these issues.
> > > > > >
> > > > > > I thought OfflineMetaRepair could be a workaround for the issues
> > > until
> > > > we
> > > > > > implement
> > > > > > the features of HBCK2.
> > > > > >
> > > > > > Regards,
> > > > > > Toshi
> > > > > >
> > > > > >
> > > > > > On Tue, May 28, 2019 at 9:12 AM Josh Elser 
> > > wrote:
> > > > > >
> > > > > > > Context: https://issues.apache.org/jira/browse/HBASE-21665
> > > > > > >
> > > > > > > I left a comment on the above issue about what I thought good
> > > 

Re: [DISCUSS] Direction of HBCK2

2019-05-29 Thread Wellington Chevreuil
Thanks Toshihiro! I guess those can still be fixed with some combinations
of commands today, such as merge/assign. Of course, it requires some extra
scripting and log reading on cases where many regions are in an
inconsistent state, maybe we should work on provide a one liner command
that relies on the current existing ones. And focus on fixing the main root
cause of such problems, as a mean to soften the need of use such commands.

I'm not really a fan of offlinemetarepair, nor hbck1 fix holes/overlaps,
would rather not have those back. Sure those are easy and convenient to
trigger, but hbck1 reports are sometimes misleading (for instance, it
reports holes when region(s) on the chain is/are simply not online), and
that, combined with availability of such heavy hammers had led
unexperienced operators to fall into running it and getting into a worse
state.

Em qua, 29 de mai de 2019 às 13:22, Toshihiro Suzuki 
escreveu:

> Hi Wellington,
>
> I saw table holes in a customer's cluster actually, and I just fixed the
> issues
> by the workaround I mentioned in HBASE-21665
>  and I didn't dig the
> reason
> why the table holes happened at that time because the customer didn't want.
>
> However, IMO, whatever the reason I think we should have a direct way to
> fix
> holes and overlaps.
>
> On Wed, May 29, 2019 at 4:57 AM Wellington Chevreuil <
> wellington.chevre...@gmail.com> wrote:
>
> > So JMS, Toshihiro, seems like upgrading from some 1.x to 2.x consistently
> > triggers this problem? Do you guys know if there are any bug jiras open
> > that would cover these scenarios? If not, and if you guys have enough
> > resources for investigating it, maybe worth open a specific jira?
> >
> > Em qua, 29 de mai de 2019 às 11:40, Jean-Marc Spaggiari <
> > jean-m...@spaggiari.org> escreveu:
> >
> > > Personnaly, when I tried to upgrade from 1.4.x to 2.2.x I end up in a
> > > situation where my meta was empty and had to get it repaired, but
> lacked
> > > OfflineMetaRepair for 2.2.x so I just had to delete all my tables, get
> a
> > > brand new installation, recreate the tables and bulkload back the data
> > into
> > > them. Would have been happy to have a OfflineMetaRepair.
> > >
> > > But it's more like an experimental cluster than a production one...
> > >
> > > JMS
> > >
> > > Le mer. 29 mai 2019 à 06:36, Wellington Chevreuil <
> > > wellington.chevre...@gmail.com> a écrit :
> > >
> > > > Interesting, I haven't seen any cases where OfflineMetaRepair was
> > really
> > > > required, among our customer base (running cdh6.1.x/hbase2.1.1,
> > > > cdh6.2/hbase2.1.2). Majority of RITs issue I had came with on hbase
> 2.x
> > > > were related to APs/SCPs failures, most of which could be sorted with
> > > hbck2
> > > > commands available by then (in some cases, required some CLI
> scripting
> > to
> > > > build up a "bulk" assign command).
> > > >
> > > > Em qua, 29 de mai de 2019 às 00:55, Toshihiro Suzuki <
> > > brfrn...@apache.org>
> > > > escreveu:
> > > >
> > > > > Hi Josh,
> > > > >
> > > > > Thank you for the explanation. I agree with the direction for
> HBCK2.
> > > > >
> > > > > The problem I wanted to tell you in the Jira is that until we
> > implement
> > > > the
> > > > > features
> > > > > you mentioned, we don't have any direct way how to fix holes and
> > > > overlaps.
> > > > > The holes and overlaps can be created by bugs or operation errors,
> > so I
> > > > > think we
> > > > > should be able to fix these issues.
> > > > >
> > > > > I thought OfflineMetaRepair could be a workaround for the issues
> > until
> > > we
> > > > > implement
> > > > > the features of HBCK2.
> > > > >
> > > > > Regards,
> > > > > Toshi
> > > > >
> > > > >
> > > > > On Tue, May 28, 2019 at 9:12 AM Josh Elser 
> > wrote:
> > > > >
> > > > > > Context: https://issues.apache.org/jira/browse/HBASE-21665
> > > > > >
> > > > > > I left a comment on the above issue about what I thought good
> > things
> > > to
> > > > > > build into HBCK2 would be -- a focus on specific "primitive"
> > > operations
> > > > > > that an admin/operator could use to help repair an otherwise
> broken
> > > > > > HBase installation. Some examples I had in my head were:
> > > > > >
> > > > > > * Create an empty region (to plug a hole)
> > > > > > * Report holes in a region chain
> > > > > >
> > > > > > In my head, the difference for HBCK2 was that we want to give
> folks
> > > the
> > > > > > tools to fix their cluster, but we did not want to own the "just
> > fix
> > > > > > everything" kind of tool that HBCK1 had become. That problem with
> > > HBCK1
> > > > > > was that it was often difficult/problematic for us to know how to
> > > > > > correctly fix a problem (the same problem could be corrected in
> > > > > > different ways).
> > > > > >
> > > > > > Andrew had some confusion about this, so I'm not sure if I'm
> > off-base
> > > > or
> > > > > > if we're all in agreement on direction and we just need to do a

Re: [DISCUSS] Direction of HBCK2

2019-05-29 Thread Toshihiro Suzuki
Hi Wellington,

I saw table holes in a customer's cluster actually, and I just fixed the
issues
by the workaround I mentioned in HBASE-21665
 and I didn't dig the
reason
why the table holes happened at that time because the customer didn't want.

However, IMO, whatever the reason I think we should have a direct way to
fix
holes and overlaps.

On Wed, May 29, 2019 at 4:57 AM Wellington Chevreuil <
wellington.chevre...@gmail.com> wrote:

> So JMS, Toshihiro, seems like upgrading from some 1.x to 2.x consistently
> triggers this problem? Do you guys know if there are any bug jiras open
> that would cover these scenarios? If not, and if you guys have enough
> resources for investigating it, maybe worth open a specific jira?
>
> Em qua, 29 de mai de 2019 às 11:40, Jean-Marc Spaggiari <
> jean-m...@spaggiari.org> escreveu:
>
> > Personnaly, when I tried to upgrade from 1.4.x to 2.2.x I end up in a
> > situation where my meta was empty and had to get it repaired, but lacked
> > OfflineMetaRepair for 2.2.x so I just had to delete all my tables, get a
> > brand new installation, recreate the tables and bulkload back the data
> into
> > them. Would have been happy to have a OfflineMetaRepair.
> >
> > But it's more like an experimental cluster than a production one...
> >
> > JMS
> >
> > Le mer. 29 mai 2019 à 06:36, Wellington Chevreuil <
> > wellington.chevre...@gmail.com> a écrit :
> >
> > > Interesting, I haven't seen any cases where OfflineMetaRepair was
> really
> > > required, among our customer base (running cdh6.1.x/hbase2.1.1,
> > > cdh6.2/hbase2.1.2). Majority of RITs issue I had came with on hbase 2.x
> > > were related to APs/SCPs failures, most of which could be sorted with
> > hbck2
> > > commands available by then (in some cases, required some CLI scripting
> to
> > > build up a "bulk" assign command).
> > >
> > > Em qua, 29 de mai de 2019 às 00:55, Toshihiro Suzuki <
> > brfrn...@apache.org>
> > > escreveu:
> > >
> > > > Hi Josh,
> > > >
> > > > Thank you for the explanation. I agree with the direction for HBCK2.
> > > >
> > > > The problem I wanted to tell you in the Jira is that until we
> implement
> > > the
> > > > features
> > > > you mentioned, we don't have any direct way how to fix holes and
> > > overlaps.
> > > > The holes and overlaps can be created by bugs or operation errors,
> so I
> > > > think we
> > > > should be able to fix these issues.
> > > >
> > > > I thought OfflineMetaRepair could be a workaround for the issues
> until
> > we
> > > > implement
> > > > the features of HBCK2.
> > > >
> > > > Regards,
> > > > Toshi
> > > >
> > > >
> > > > On Tue, May 28, 2019 at 9:12 AM Josh Elser 
> wrote:
> > > >
> > > > > Context: https://issues.apache.org/jira/browse/HBASE-21665
> > > > >
> > > > > I left a comment on the above issue about what I thought good
> things
> > to
> > > > > build into HBCK2 would be -- a focus on specific "primitive"
> > operations
> > > > > that an admin/operator could use to help repair an otherwise broken
> > > > > HBase installation. Some examples I had in my head were:
> > > > >
> > > > > * Create an empty region (to plug a hole)
> > > > > * Report holes in a region chain
> > > > >
> > > > > In my head, the difference for HBCK2 was that we want to give folks
> > the
> > > > > tools to fix their cluster, but we did not want to own the "just
> fix
> > > > > everything" kind of tool that HBCK1 had become. That problem with
> > HBCK1
> > > > > was that it was often difficult/problematic for us to know how to
> > > > > correctly fix a problem (the same problem could be corrected in
> > > > > different ways).
> > > > >
> > > > > Andrew had some confusion about this, so I'm not sure if I'm
> off-base
> > > or
> > > > > if we're all in agreement on direction and we just need to do a
> > better
> > > > > job documenting things. Thanks for keeping me honest either way :)
> > > > >
> > > > > And just in case it doesn't go without saying, HBCK2 would be
> > something
> > > > > that helps fix a system, while we want to always understand the
> root
> > > > > cause of how/why we got into a situation where we needed HBCK2 and
> > also
> > > > > address that.
> > > > >
> > > > > - Josh
> > > > >
> > > >
> > >
> >
>


Re: [DISCUSS] Direction of HBCK2

2019-05-29 Thread Wellington Chevreuil
So JMS, Toshihiro, seems like upgrading from some 1.x to 2.x consistently
triggers this problem? Do you guys know if there are any bug jiras open
that would cover these scenarios? If not, and if you guys have enough
resources for investigating it, maybe worth open a specific jira?

Em qua, 29 de mai de 2019 às 11:40, Jean-Marc Spaggiari <
jean-m...@spaggiari.org> escreveu:

> Personnaly, when I tried to upgrade from 1.4.x to 2.2.x I end up in a
> situation where my meta was empty and had to get it repaired, but lacked
> OfflineMetaRepair for 2.2.x so I just had to delete all my tables, get a
> brand new installation, recreate the tables and bulkload back the data into
> them. Would have been happy to have a OfflineMetaRepair.
>
> But it's more like an experimental cluster than a production one...
>
> JMS
>
> Le mer. 29 mai 2019 à 06:36, Wellington Chevreuil <
> wellington.chevre...@gmail.com> a écrit :
>
> > Interesting, I haven't seen any cases where OfflineMetaRepair was really
> > required, among our customer base (running cdh6.1.x/hbase2.1.1,
> > cdh6.2/hbase2.1.2). Majority of RITs issue I had came with on hbase 2.x
> > were related to APs/SCPs failures, most of which could be sorted with
> hbck2
> > commands available by then (in some cases, required some CLI scripting to
> > build up a "bulk" assign command).
> >
> > Em qua, 29 de mai de 2019 às 00:55, Toshihiro Suzuki <
> brfrn...@apache.org>
> > escreveu:
> >
> > > Hi Josh,
> > >
> > > Thank you for the explanation. I agree with the direction for HBCK2.
> > >
> > > The problem I wanted to tell you in the Jira is that until we implement
> > the
> > > features
> > > you mentioned, we don't have any direct way how to fix holes and
> > overlaps.
> > > The holes and overlaps can be created by bugs or operation errors, so I
> > > think we
> > > should be able to fix these issues.
> > >
> > > I thought OfflineMetaRepair could be a workaround for the issues until
> we
> > > implement
> > > the features of HBCK2.
> > >
> > > Regards,
> > > Toshi
> > >
> > >
> > > On Tue, May 28, 2019 at 9:12 AM Josh Elser  wrote:
> > >
> > > > Context: https://issues.apache.org/jira/browse/HBASE-21665
> > > >
> > > > I left a comment on the above issue about what I thought good things
> to
> > > > build into HBCK2 would be -- a focus on specific "primitive"
> operations
> > > > that an admin/operator could use to help repair an otherwise broken
> > > > HBase installation. Some examples I had in my head were:
> > > >
> > > > * Create an empty region (to plug a hole)
> > > > * Report holes in a region chain
> > > >
> > > > In my head, the difference for HBCK2 was that we want to give folks
> the
> > > > tools to fix their cluster, but we did not want to own the "just fix
> > > > everything" kind of tool that HBCK1 had become. That problem with
> HBCK1
> > > > was that it was often difficult/problematic for us to know how to
> > > > correctly fix a problem (the same problem could be corrected in
> > > > different ways).
> > > >
> > > > Andrew had some confusion about this, so I'm not sure if I'm off-base
> > or
> > > > if we're all in agreement on direction and we just need to do a
> better
> > > > job documenting things. Thanks for keeping me honest either way :)
> > > >
> > > > And just in case it doesn't go without saying, HBCK2 would be
> something
> > > > that helps fix a system, while we want to always understand the root
> > > > cause of how/why we got into a situation where we needed HBCK2 and
> also
> > > > address that.
> > > >
> > > > - Josh
> > > >
> > >
> >
>


Re: [DISCUSS] Direction of HBCK2

2019-05-29 Thread Jean-Marc Spaggiari
Personnaly, when I tried to upgrade from 1.4.x to 2.2.x I end up in a
situation where my meta was empty and had to get it repaired, but lacked
OfflineMetaRepair for 2.2.x so I just had to delete all my tables, get a
brand new installation, recreate the tables and bulkload back the data into
them. Would have been happy to have a OfflineMetaRepair.

But it's more like an experimental cluster than a production one...

JMS

Le mer. 29 mai 2019 à 06:36, Wellington Chevreuil <
wellington.chevre...@gmail.com> a écrit :

> Interesting, I haven't seen any cases where OfflineMetaRepair was really
> required, among our customer base (running cdh6.1.x/hbase2.1.1,
> cdh6.2/hbase2.1.2). Majority of RITs issue I had came with on hbase 2.x
> were related to APs/SCPs failures, most of which could be sorted with hbck2
> commands available by then (in some cases, required some CLI scripting to
> build up a "bulk" assign command).
>
> Em qua, 29 de mai de 2019 às 00:55, Toshihiro Suzuki 
> escreveu:
>
> > Hi Josh,
> >
> > Thank you for the explanation. I agree with the direction for HBCK2.
> >
> > The problem I wanted to tell you in the Jira is that until we implement
> the
> > features
> > you mentioned, we don't have any direct way how to fix holes and
> overlaps.
> > The holes and overlaps can be created by bugs or operation errors, so I
> > think we
> > should be able to fix these issues.
> >
> > I thought OfflineMetaRepair could be a workaround for the issues until we
> > implement
> > the features of HBCK2.
> >
> > Regards,
> > Toshi
> >
> >
> > On Tue, May 28, 2019 at 9:12 AM Josh Elser  wrote:
> >
> > > Context: https://issues.apache.org/jira/browse/HBASE-21665
> > >
> > > I left a comment on the above issue about what I thought good things to
> > > build into HBCK2 would be -- a focus on specific "primitive" operations
> > > that an admin/operator could use to help repair an otherwise broken
> > > HBase installation. Some examples I had in my head were:
> > >
> > > * Create an empty region (to plug a hole)
> > > * Report holes in a region chain
> > >
> > > In my head, the difference for HBCK2 was that we want to give folks the
> > > tools to fix their cluster, but we did not want to own the "just fix
> > > everything" kind of tool that HBCK1 had become. That problem with HBCK1
> > > was that it was often difficult/problematic for us to know how to
> > > correctly fix a problem (the same problem could be corrected in
> > > different ways).
> > >
> > > Andrew had some confusion about this, so I'm not sure if I'm off-base
> or
> > > if we're all in agreement on direction and we just need to do a better
> > > job documenting things. Thanks for keeping me honest either way :)
> > >
> > > And just in case it doesn't go without saying, HBCK2 would be something
> > > that helps fix a system, while we want to always understand the root
> > > cause of how/why we got into a situation where we needed HBCK2 and also
> > > address that.
> > >
> > > - Josh
> > >
> >
>


Re: [DISCUSS] Direction of HBCK2

2019-05-29 Thread Wellington Chevreuil
Interesting, I haven't seen any cases where OfflineMetaRepair was really
required, among our customer base (running cdh6.1.x/hbase2.1.1,
cdh6.2/hbase2.1.2). Majority of RITs issue I had came with on hbase 2.x
were related to APs/SCPs failures, most of which could be sorted with hbck2
commands available by then (in some cases, required some CLI scripting to
build up a "bulk" assign command).

Em qua, 29 de mai de 2019 às 00:55, Toshihiro Suzuki 
escreveu:

> Hi Josh,
>
> Thank you for the explanation. I agree with the direction for HBCK2.
>
> The problem I wanted to tell you in the Jira is that until we implement the
> features
> you mentioned, we don't have any direct way how to fix holes and overlaps.
> The holes and overlaps can be created by bugs or operation errors, so I
> think we
> should be able to fix these issues.
>
> I thought OfflineMetaRepair could be a workaround for the issues until we
> implement
> the features of HBCK2.
>
> Regards,
> Toshi
>
>
> On Tue, May 28, 2019 at 9:12 AM Josh Elser  wrote:
>
> > Context: https://issues.apache.org/jira/browse/HBASE-21665
> >
> > I left a comment on the above issue about what I thought good things to
> > build into HBCK2 would be -- a focus on specific "primitive" operations
> > that an admin/operator could use to help repair an otherwise broken
> > HBase installation. Some examples I had in my head were:
> >
> > * Create an empty region (to plug a hole)
> > * Report holes in a region chain
> >
> > In my head, the difference for HBCK2 was that we want to give folks the
> > tools to fix their cluster, but we did not want to own the "just fix
> > everything" kind of tool that HBCK1 had become. That problem with HBCK1
> > was that it was often difficult/problematic for us to know how to
> > correctly fix a problem (the same problem could be corrected in
> > different ways).
> >
> > Andrew had some confusion about this, so I'm not sure if I'm off-base or
> > if we're all in agreement on direction and we just need to do a better
> > job documenting things. Thanks for keeping me honest either way :)
> >
> > And just in case it doesn't go without saying, HBCK2 would be something
> > that helps fix a system, while we want to always understand the root
> > cause of how/why we got into a situation where we needed HBCK2 and also
> > address that.
> >
> > - Josh
> >
>


Re: [DISCUSS] Direction of HBCK2

2019-05-28 Thread Toshihiro Suzuki
Hi Josh,

Thank you for the explanation. I agree with the direction for HBCK2.

The problem I wanted to tell you in the Jira is that until we implement the
features
you mentioned, we don't have any direct way how to fix holes and overlaps.
The holes and overlaps can be created by bugs or operation errors, so I
think we
should be able to fix these issues.

I thought OfflineMetaRepair could be a workaround for the issues until we
implement
the features of HBCK2.

Regards,
Toshi


On Tue, May 28, 2019 at 9:12 AM Josh Elser  wrote:

> Context: https://issues.apache.org/jira/browse/HBASE-21665
>
> I left a comment on the above issue about what I thought good things to
> build into HBCK2 would be -- a focus on specific "primitive" operations
> that an admin/operator could use to help repair an otherwise broken
> HBase installation. Some examples I had in my head were:
>
> * Create an empty region (to plug a hole)
> * Report holes in a region chain
>
> In my head, the difference for HBCK2 was that we want to give folks the
> tools to fix their cluster, but we did not want to own the "just fix
> everything" kind of tool that HBCK1 had become. That problem with HBCK1
> was that it was often difficult/problematic for us to know how to
> correctly fix a problem (the same problem could be corrected in
> different ways).
>
> Andrew had some confusion about this, so I'm not sure if I'm off-base or
> if we're all in agreement on direction and we just need to do a better
> job documenting things. Thanks for keeping me honest either way :)
>
> And just in case it doesn't go without saying, HBCK2 would be something
> that helps fix a system, while we want to always understand the root
> cause of how/why we got into a situation where we needed HBCK2 and also
> address that.
>
> - Josh
>