I did a both barrels type response to a suggestion Wellington made that I hope communicates the right level of dismay at the prevailing line of thought in this thread.
Let me say I agree hbck 1 was sometimes oversold as a magic tool. However if you analyze all of its options and then look to branch 2, where are the gaps. In branch 1 there is a command line tool that can be executed by operations and first level support. Its options can be described in a runbook with cut and paste examples. In branch 2 ... ? There appears no ready solution for detecting and deploying undeployed “missing” regions. There appears no ready solution for fixing a failed split or merge or other corruption producing a hole or overlap in the region chain. There appears no tool capable of rebuilding meta from scratch from HDFS level metadata; a last but crucial resort as this is what holds the line against a complete and time intensive restore from backup. I may have an incorrect impression of some of this. If so that would be a big relief. If not these are suggested areas of focus. I’m not saying that 2 needs Hbck exactly as it is in 1. However the lack of simple recovery tools or actions that can be taken by a non expert guided by a runbook means the risk to operations when there is the inevitable problem is higher. And I don’t mean theoretical problems. I mean the commonly occurring issues Hbck 1 was coded up to address in a mostly automated way, like failed splits or failed deployments or simple HDFS level corruptions like loss of meta region hfiles. Lacking simple tooling our operations will have to do <something> more complex, labor intensive, and or risky. This factors in to the major version upgrade risk analysis. What I would advise is an analysis that enumerates all of the risks and specific conditions Hbck 1 addresses, then excludes those not relevant for the 2 code base, then excludes those which have easy and simple tools existing right now to solve. What you have left is a list of action items. Then there should be an analysis of the new risks in 2 given AMv2s theory of operation, for example for each procedure based action if the procedure is always failing how can the operator recover the prerequisites for successful completion, and provide a simple tool or option for applying a fix or remediation to cluster state. > On May 30, 2019, at 7:16 AM, Josh Elser <els...@apache.org> wrote: > > Right, this discuss isn't meant to be implying that any of this exists -- > instead, I wanted to make sure we're focused on building tooling which both > devs and users will find usable and effective. > > What's your gut-reaction to what I suggested? I think you're saying you see > operators having to apply more understanding/insight to fix a "complex > problem" as taking on more risk which you'd have to weigh. In other words, > anything less than the verbatim "fix these problems" flags you mentioned > earlier would require you to do the risk-analysis math if moving to HBase2? > > Thanks for your insights. > >> On 5/29/19 4:45 PM, Andrew Purtell wrote: >> I have yet to see essential HBCK functions in 1 replaced by anything - >> documentation, script, hbck2, whatever. >> Do we have a tool or script in HBase 2 that can rebuild meta from HDFS >> state? This would be faster than a complete restore from backup. It would >> be useful and important to offer this option to operators, but not >> essential, because it could be valid to say if meta is screwed so are you >> and you have to restore completely from backup. Meta is small, a fraction >> of total data footprint. Seems a real shame to impose such a high cost when >> there could be an alternative. I'd have to think for a while about >> accepting this kind of operational risk when HBase 1 has such tooling. >> What I am more worried about is this: Do we have a tool or script in HBase >> 2 that can fix errors in the region chain caused by failed splits, failed >> merges, or double assignment? It seems not, and the implications for >> service availability are not good when compared with HBase 1. With HBase 1, >> hbck is an option. Sure, it has a lot of problematic aspects, but I have >> seen it recover a cluster's total availability with fairly fast execution. >> It could be valid, not saying I agree with this point, to clearly document >> that all aspects of recovery from corrupted metadata is the responsibility >> of the operator, at least this is full disclosure. We can then weigh the >> cost and risk associated with this policy when deciding if ever to upgrade. >>> On Wed, May 29, 2019 at 1:13 PM Josh Elser <els...@apache.org> wrote: >>> My understanding was that recreating sweeping "fix it" flags was an >>> anti-goal of HBCK2, but I'm surprised a grey-beard hasn't come in to say >>> confirm/dispute that :). I could be taking that out of context or my dog >>> remembers things better than I do. >>> >>> The reasoning behind this line of thinking for HBCK2 is: >>> >>> * Smaller actions are easier to implement correctly and be well-tested >>> * The more complex the action, the more likely it is for something we >>> (as devs) didn't expect to happen which results in a bug. >>> >>> The "stretch" in my mind is that we can string together small actions to >>> recreate the bigger ones (the fix* type commands from hbck1), *but* >>> teach operators to apply knowledge about their cluster instead of >>> treating hbck like a black box. >>> >>> For example, if we try to decompose something like fixAssignments into >>> something like: `for region in $(list non-open regions); do assign >>> $region; end`. As developers, we don't have to catch every edge case of >>> _something_ that might be specific to the admin's actual situation (e.g. >>> what if a table is disabled and we don't want to assign those regions) >>> and it lets us write better test cases. >>> >>> Again, this is what I have floating around in my head -- nothing more >>> than that at present. >>> >>>> On 5/29/19 11:54 AM, Andrew Purtell wrote: >>>> To me this is a succinct specification of minimum functionality for a >>>> recovery tool: using on disk bits, rebuild meta table, with end result a >>>> working cluster that did not miss any data during the reconstruction. >>>> >>>> Of course focusing on root causes of metadata mismanagement is >>> appropriate >>>> when investigating a specific incident, but this is orthogonal from the >>>> question of whether or not recovery is possible after a bug corrupts >>>> metadata. It is customary for filesystems and databases to ship with a >>> tool >>>> that attempts recovery after corruption, on the (correct, IMHO) >>> assumption >>>> that corruption is inevitable, either due to logic bug, hardware >>> problems, >>>> or operator error. >>>> >>>> The features of hbck in HBase 1 that have resolved availability problems >>>> where I work are: fixMeta, fixAssignments, fixHdfsHoles, fixHdfsOverlaps. >>>> In HBaseFsck.java in branch-2 these are all in the unsupported options >>> set. >>>> Because these are all lacking in HBase 2 I will not certify it ready for >>>> production to my employer. If there is some other tool which offers these >>>> recovery options I'm not aware of it nor documentation for it and would >>>> appreciate a pointer if you have one. >>>> >>>> >>>> On Wed, May 29, 2019 at 7:11 AM Toshihiro Suzuki <brfrn...@apache.org> >>>> wrote: >>>> >>>>> Thanks Wellington. >>>>> >>>>>> I guess those can still be fixed with some combinations of commands >>>>> today, >>>>>> such as merge/assign. >>>>> >>>>> Let me explain the situation I faced in the customer's cluster a little >>> bit >>>>> more. >>>>> It seemed like the table data in HDFS was intact but they lost some meta >>>>> data >>>>> (in hbase:meta) of the table. So I needed to rebuild the meta from HDFS >>>>> data. >>>>> In this case, we can still fix with some combinations of commands >>> today? If >>>>> so, >>>>> I would appreciate it if you could suggest the steps to me. >>>>> >>>>>> And focus on fixing the main root cause of such problems, as a mean to >>>>>> soften the need of use such commands. >>>>> >>>>> Yes, correct. Actually I usually do that. But I didn't do that in that >>>>> case.. >>>>> >>>>> >>>>> On Wed, May 29, 2019 at 5:47 AM Wellington Chevreuil < >>>>> wellington.chevre...@gmail.com> wrote: >>>>> >>>>>> Thanks Toshihiro! I guess those can still be fixed with some >>> combinations >>>>>> of commands today, such as merge/assign. Of course, it requires some >>>>> extra >>>>>> scripting and log reading on cases where many regions are in an >>>>>> inconsistent state, maybe we should work on provide a one liner command >>>>>> that relies on the current existing ones. And focus on fixing the main >>>>> root >>>>>> cause of such problems, as a mean to soften the need of use such >>>>> commands. >>>>>> >>>>>> I'm not really a fan of offlinemetarepair, nor hbck1 fix >>> holes/overlaps, >>>>>> would rather not have those back. Sure those are easy and convenient to >>>>>> trigger, but hbck1 reports are sometimes misleading (for instance, it >>>>>> reports holes when region(s) on the chain is/are simply not online), >>> and >>>>>> that, combined with availability of such heavy hammers had led >>>>>> unexperienced operators to fall into running it and getting into a >>> worse >>>>>> state. >>>>>> >>>>>> Em qua, 29 de mai de 2019 às 13:22, Toshihiro Suzuki < >>>>> brfrn...@apache.org> >>>>>> escreveu: >>>>>> >>>>>>> Hi Wellington, >>>>>>> >>>>>>> I saw table holes in a customer's cluster actually, and I just fixed >>>>> the >>>>>>> issues >>>>>>> by the workaround I mentioned in HBASE-21665 >>>>>>> <https://issues.apache.org/jira/browse/HBASE-21665> and I didn't dig >>>>> the >>>>>>> reason >>>>>>> why the table holes happened at that time because the customer didn't >>>>>> want. >>>>>>> >>>>>>> However, IMO, whatever the reason I think we should have a direct way >>>>> to >>>>>>> fix >>>>>>> holes and overlaps. >>>>>>> >>>>>>> On Wed, May 29, 2019 at 4:57 AM Wellington Chevreuil < >>>>>>> wellington.chevre...@gmail.com> wrote: >>>>>>> >>>>>>>> So JMS, Toshihiro, seems like upgrading from some 1.x to 2.x >>>>>> consistently >>>>>>>> triggers this problem? Do you guys know if there are any bug jiras >>>>> open >>>>>>>> that would cover these scenarios? If not, and if you guys have enough >>>>>>>> resources for investigating it, maybe worth open a specific jira? >>>>>>>> >>>>>>>> Em qua, 29 de mai de 2019 às 11:40, Jean-Marc Spaggiari < >>>>>>>> jean-m...@spaggiari.org> escreveu: >>>>>>>> >>>>>>>>> Personnaly, when I tried to upgrade from 1.4.x to 2.2.x I end up >>>>> in a >>>>>>>>> situation where my meta was empty and had to get it repaired, but >>>>>>> lacked >>>>>>>>> OfflineMetaRepair for 2.2.x so I just had to delete all my tables, >>>>>> get >>>>>>> a >>>>>>>>> brand new installation, recreate the tables and bulkload back the >>>>>> data >>>>>>>> into >>>>>>>>> them. Would have been happy to have a OfflineMetaRepair. >>>>>>>>> >>>>>>>>> But it's more like an experimental cluster than a production one... >>>>>>>>> >>>>>>>>> JMS >>>>>>>>> >>>>>>>>> Le mer. 29 mai 2019 à 06:36, Wellington Chevreuil < >>>>>>>>> wellington.chevre...@gmail.com> a écrit : >>>>>>>>> >>>>>>>>>> Interesting, I haven't seen any cases where OfflineMetaRepair was >>>>>>>> really >>>>>>>>>> required, among our customer base (running cdh6.1.x/hbase2.1.1, >>>>>>>>>> cdh6.2/hbase2.1.2). Majority of RITs issue I had came with on >>>>> hbase >>>>>>> 2.x >>>>>>>>>> were related to APs/SCPs failures, most of which could be sorted >>>>>> with >>>>>>>>> hbck2 >>>>>>>>>> commands available by then (in some cases, required some CLI >>>>>>> scripting >>>>>>>> to >>>>>>>>>> build up a "bulk" assign command). >>>>>>>>>> >>>>>>>>>> Em qua, 29 de mai de 2019 às 00:55, Toshihiro Suzuki < >>>>>>>>> brfrn...@apache.org> >>>>>>>>>> escreveu: >>>>>>>>>> >>>>>>>>>>> Hi Josh, >>>>>>>>>>> >>>>>>>>>>> Thank you for the explanation. I agree with the direction for >>>>>>> HBCK2. >>>>>>>>>>> >>>>>>>>>>> The problem I wanted to tell you in the Jira is that until we >>>>>>>> implement >>>>>>>>>> the >>>>>>>>>>> features >>>>>>>>>>> you mentioned, we don't have any direct way how to fix holes >>>>> and >>>>>>>>>> overlaps. >>>>>>>>>>> The holes and overlaps can be created by bugs or operation >>>>>> errors, >>>>>>>> so I >>>>>>>>>>> think we >>>>>>>>>>> should be able to fix these issues. >>>>>>>>>>> >>>>>>>>>>> I thought OfflineMetaRepair could be a workaround for the >>>>> issues >>>>>>>> until >>>>>>>>> we >>>>>>>>>>> implement >>>>>>>>>>> the features of HBCK2. >>>>>>>>>>> >>>>>>>>>>> Regards, >>>>>>>>>>> Toshi >>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> On Tue, May 28, 2019 at 9:12 AM Josh Elser <els...@apache.org> >>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Context: https://issues.apache.org/jira/browse/HBASE-21665 >>>>>>>>>>>> >>>>>>>>>>>> I left a comment on the above issue about what I thought good >>>>>>>> things >>>>>>>>> to >>>>>>>>>>>> build into HBCK2 would be -- a focus on specific "primitive" >>>>>>>>> operations >>>>>>>>>>>> that an admin/operator could use to help repair an otherwise >>>>>>> broken >>>>>>>>>>>> HBase installation. Some examples I had in my head were: >>>>>>>>>>>> >>>>>>>>>>>> * Create an empty region (to plug a hole) >>>>>>>>>>>> * Report holes in a region chain >>>>>>>>>>>> >>>>>>>>>>>> In my head, the difference for HBCK2 was that we want to give >>>>>>> folks >>>>>>>>> the >>>>>>>>>>>> tools to fix their cluster, but we did not want to own the >>>>>> "just >>>>>>>> fix >>>>>>>>>>>> everything" kind of tool that HBCK1 had become. That problem >>>>>> with >>>>>>>>> HBCK1 >>>>>>>>>>>> was that it was often difficult/problematic for us to know >>>>> how >>>>>> to >>>>>>>>>>>> correctly fix a problem (the same problem could be corrected >>>>> in >>>>>>>>>>>> different ways). >>>>>>>>>>>> >>>>>>>>>>>> Andrew had some confusion about this, so I'm not sure if I'm >>>>>>>> off-base >>>>>>>>>> or >>>>>>>>>>>> if we're all in agreement on direction and we just need to >>>>> do a >>>>>>>>> better >>>>>>>>>>>> job documenting things. Thanks for keeping me honest either >>>>> way >>>>>>> :) >>>>>>>>>>>> >>>>>>>>>>>> And just in case it doesn't go without saying, HBCK2 would be >>>>>>>>> something >>>>>>>>>>>> that helps fix a system, while we want to always understand >>>>> the >>>>>>>> root >>>>>>>>>>>> cause of how/why we got into a situation where we needed >>>>> HBCK2 >>>>>>> and >>>>>>>>> also >>>>>>>>>>>> address that. >>>>>>>>>>>> >>>>>>>>>>>> - Josh >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>>> >>>