Hi Sean, Thanks for spending some time on that.
This option was one on my plans ;) But I might end up just wiping everything and try to do an export snapshot from my other HBase 1.2.0 cluster and see what it does... JMS Le ven. 29 mars 2019 à 19:32, Sean Busbey <[email protected]> a écrit : > Putting aside for now speculation on how you go to this state, I think > with current tooling your best option for recovery is to sideline the > /hbase directory, start with a fresh install, create your namespaces & > tables, bulkload the sidelined hfiles > > JIRAs that aim to improve this situation, I'm sure feedback or help > welcome: > > * HBASE-21665 "OfflineMetaRepair tool fails with NPE" > * HBASE-18840 "Add functionality to refresh meta table at master > startup" (as an alternative to making OfflineMetaRepairTool work > again; busted according to HBASE-21665) > * HBASE-21966 "Fix region holes, overlaps, and other region related errors" > > On Fri, Mar 29, 2019 at 1:08 PM Jean-Marc Spaggiari > <[email protected]> wrote: > > > > Hi Sean, > > > > Here is the hdfs content: https://pastebin.com/EqK1zhEe > > > > I unfortunately don't have HDFS audit logs :( And I cleaned HBase logs > > before the last upgrade test, so RCA will be difficult :-/ > > > > JMS > > > > Le ven. 29 mars 2019 à 16:01, Sean Busbey <[email protected]> a écrit : > > > > > So all we have in hbase:meta is an entry for each table that claims > > > they're all in enabled state. > > > > > > And the info column family is totally empty? I believe this is a > > > failure state we don't have tooling for yet. Can you upload and link > > > the results of running hdfs dfs -ls -R on the /hbase directory? > > > > > > Do you happen to have HDFS auditing turned on and logs that go back a > > > few weeks? I'd be curious about how we got into this state. The only > > > way I've seen it happen thus far is when folks disabled the safety > > > that keeps hbck1 from running. > > > > > > On Fri, Mar 29, 2019 at 9:40 AM Jean-Marc Spaggiari > > > <[email protected]> wrote: > > > > > > > > Hi Sean, > > > > > > > > Thanks again for keeping an eye on that. > > > > > > > > I think the META content has been lost somewhere in the process. > > > > > > > > drwxr-xr-x - hbase supergroup 0 2019-03-12 15:42 > > > > /hbase/data/hbase/meta/.tabledesc > > > > -rw-r--r-- 3 hbase supergroup 1447 2019-03-12 15:42 > > > > /hbase/data/hbase/meta/.tabledesc/.tableinfo.0000000001 > > > > drwxr-xr-x - hbase supergroup 0 2019-03-12 15:42 > > > > /hbase/data/hbase/meta/.tmp > > > > drwxr-xr-x - hbase supergroup 0 2019-03-12 15:49 > > > > /hbase/data/hbase/meta/1588230740 > > > > -rw-r--r-- 3 hbase supergroup 32 2019-03-12 15:40 > > > > /hbase/data/hbase/meta/1588230740/.regioninfo > > > > drwxr-xr-x - hbase supergroup 0 2019-03-12 15:40 > > > > /hbase/data/hbase/meta/1588230740/info > > > > drwxr-xr-x - hbase supergroup 0 2019-03-12 15:40 > > > > /hbase/data/hbase/meta/1588230740/recovered.edits > > > > -rw-r--r-- 3 hbase supergroup 0 2019-03-12 15:40 > > > > /hbase/data/hbase/meta/1588230740/recovered.edits/2.seqid > > > > drwxr-xr-x - hbase supergroup 0 2019-03-12 15:42 > > > > /hbase/data/hbase/meta/1588230740/rep_barrier > > > > drwxr-xr-x - hbase supergroup 0 2019-03-12 15:47 > > > > /hbase/data/hbase/meta/1588230740/table > > > > -rw-r--r-- 3 hbase supergroup 5454 2019-03-12 15:47 > > > > > /hbase/data/hbase/meta/1588230740/table/b65e8774ff284e77bf22641de36110cc > > > > > > > > And this is the content of the file: > > > > hbase@node2:~$ hbase hfile -p -f > > > > > /hbase/data/hbase/meta/1588230740/table/b65e8774ff284e77bf22641de36110cc > > > > 2019-03-29 12:38:36,028 INFO [main] metrics.MetricRegistries: Loaded > > > > MetricRegistries class > > > > org.apache.hadoop.hbase.metrics.impl.MetricRegistriesImpl > > > > K: customers/table:state/1552419727646/Put/vlen=2/seqid=903258414 V: > > > > \x08\x00 > > > > K: dns/table:state/1552419727462/Put/vlen=2/seqid=903258404 V: > \x08\x00 > > > > K: email/table:state/1552419727691/Put/vlen=2/seqid=903258416 V: > \x08\x00 > > > > K: > email_proposed/table:state/1552419727602/Put/vlen=2/seqid=903258410 V: > > > > \x08\x00 > > > > K: ew_table/table:state/1552419727527/Put/vlen=2/seqid=903258406 V: > > > \x08\x00 > > > > K: hbase:acl/table:state/1552419727547/Put/vlen=2/seqid=903258407 V: > > > > \x08\x00 > > > > K: > hbase:namespace/table:state/1552419727382/Put/vlen=2/seqid=903258402 > > > V: > > > > \x08\x00 > > > > K: page/table:state/1552419727669/Put/vlen=2/seqid=903258415 V: > \x08\x00 > > > > K: pageAvro/table:state/1552419727572/Put/vlen=2/seqid=903258408 V: > > > \x08\x00 > > > > K: pageMini/table:state/1552419727591/Put/vlen=2/seqid=903258409 V: > > > \x08\x00 > > > > K: pageSpark/table:state/1552419727867/Put/vlen=2/seqid=903258417 V: > > > > \x08\x00 > > > > K: page_crc/table:state/1552419727635/Put/vlen=2/seqid=903258413 V: > > > \x08\x00 > > > > K: > page_duplicate/table:state/1552419727613/Put/vlen=2/seqid=903258411 V: > > > > \x08\x00 > > > > K: > page_proposed/table:state/1552419727175/Put/vlen=2/seqid=903258401 V: > > > > \x08\x00 > > > > K: tree/table:state/1552419727502/Put/vlen=2/seqid=903258405 V: > \x08\x00 > > > > K: > work_proposed/table:state/1552419727402/Put/vlen=2/seqid=903258403 V: > > > > \x08\x00 > > > > K: work_sent/table:state/1552419727624/Put/vlen=2/seqid=903258412 V: > > > > \x08\x00 > > > > Scanned kv count -> 17 > > > > > > > > Seems that it's still aware of the tables. But I don't see any > reference > > > to > > > > any server... > > > > > > > > JMS > > > > > > > > > > > > Le ven. 29 mars 2019 à 12:25, Sean Busbey <[email protected]> a > écrit : > > > > > > > > > Okay I read the logs again and we're in a weird failure state. > > > > > > > > > > 1) Master comes up > > > > > 2) Master schedules SCP for all RS > > > > > 3) Master recovers meta > > > > > 4) SCP for every server claims AM currently thinks 0 regions were > > > > > assigned to each server. > > > > > 5) Master successfully finishes WAL splitting from dead RS and > works > > > > > through prior split attempts that died? > > > > > 6) WAL recovery from every RS says there are no edits for any > region > > > > > 7) No Assignments are scheduled out of the SCP because each > believes > > > > > there were no regions hosted on the server that's being processed > > > > > 6) Master reports all SCP have completed successfully > > > > > 7) Master times out at initializing > > > > > > > > > > Could you link to a scan of meta? it'll include server names, table > > > > > names, and region information, so I'm not sure if any of those are > too > > > > > sensitive? > > > > > > > > > > On Thu, Mar 14, 2019 at 11:36 AM Jean-Marc Spaggiari > > > > > <[email protected]> wrote: > > > > > > > > > > > > Updated logs are there: https://pastebin.com/1UrTA8JS > > > > > > > > > > > > They really look like exactly the same as the previous version > :-/ > > > > > > > > > > > > There is no warning, no error, nothing :( > > > > > > > > > > > > JMS > > > > > > > > > > > > Le jeu. 14 mars 2019 à 13:38, Sean Busbey <[email protected]> a > > > écrit : > > > > > > > > > > > > > We still need to find out why hbase:namespace is not online. > Did > > > the > > > > > > > logs complaining about being unable to assign regions not > include > > > any > > > > > > > thing about the region(s) for the namespace table? > > > > > > > > > > > > > > Can you upload updated logs? > > > > > > > > > > > > > > If there's no mention of it then that sounds like we need an > hbck2 > > > > > > > command to output the current assignment state of a region. > > > > > > > > > > > > > > On Thu, Mar 14, 2019 at 11:57 AM Jean-Marc Spaggiari > > > > > > > <[email protected]> wrote: > > > > > > > > > > > > > > > > I stopped all the region servers, started the master. It was > > > > > complaining > > > > > > > > about not being able to assign regions. Then started region > > > servers, > > > > > but > > > > > > > > after 5 minutes got the same error :-/ > > > > > > > > > > > > > > > > 2019-03-14 12:46:38,586 ERROR > > > [master/node2:60000:becomeActiveMaster] > > > > > > > > master.HMaster: Failed to become active master > > > > > > > > java.lang.IllegalStateException: Expected the service > > > > > > > > ClusterSchemaServiceImpl [FAILED] to be RUNNING, but the > service > > > has > > > > > > > FAILED > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.checkCurrentState(AbstractService.java:345) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.awaitRunning(AbstractService.java:291) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.HMaster.initClusterSchemaService(HMaster.java:1341) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:1119) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2347) > > > > > > > > at > > > > > > org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:595) > > > > > > > > at java.lang.Thread.run(Thread.java:748) > > > > > > > > Caused by: java.io.IOException: Timedout 300000ms waiting for > > > > > namespace > > > > > > > > table to be assigned and enabled: tableName=hbase:namespace, > > > > > > > state=ENABLED > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:108) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.ClusterSchemaServiceImpl.doStart(ClusterSchemaServiceImpl.java:63) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.startAsync(AbstractService.java:226) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.HMaster.initClusterSchemaService(HMaster.java:1339) > > > > > > > > ... 4 more > > > > > > > > > > > > > > > > Then stopped all, configured the maintenance mode, started > all, > > > get > > > > > the > > > > > > > > same error. I tried to bounce the RS within those 5 minutes > > > without > > > > > any > > > > > > > > difference. I still get the same exception after 5 minutes: > > > > > > > > 2019-03-14 12:55:35,167 ERROR > > > [master/node2:60000:becomeActiveMaster] > > > > > > > > master.HMaster: ***** ABORTING master node2.distparser.com > > > > > > > ,60000,1552582220013: > > > > > > > > Unhandled exception. Starting shutdown. ***** > > > > > > > > java.lang.IllegalStateException: Expected the service > > > > > > > > ClusterSchemaServiceImpl [FAILED] to be RUNNING, but the > service > > > has > > > > > > > FAILED > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.checkCurrentState(AbstractService.java:345) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.awaitRunning(AbstractService.java:291) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.HMaster.initClusterSchemaService(HMaster.java:1341) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:1119) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2347) > > > > > > > > at > > > > > > org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:595) > > > > > > > > at java.lang.Thread.run(Thread.java:748) > > > > > > > > Caused by: java.io.IOException: Timedout 300000ms waiting for > > > > > namespace > > > > > > > > table to be assigned and enabled: tableName=hbase:namespace, > > > > > > > state=ENABLED > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:108) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.ClusterSchemaServiceImpl.doStart(ClusterSchemaServiceImpl.java:63) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.startAsync(AbstractService.java:226) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.HMaster.initClusterSchemaService(HMaster.java:1339) > > > > > > > > ... 4 more > > > > > > > > > > > > > > > > I validated the maintenance mode: > > > > > > > > 2019-03-14 12:50:25,421 INFO [main] master.HMaster: Detected > > > > > > > > hbase.master.maintenance_mode=true via configuration. > > > > > > > > > > > > > > > > And now removed it. > > > > > > > > > > > > > > > > What can I try next? > > > > > > > > > > > > > > > > I know I can always rename /hbase to /hbase_old and bulkload > the > > > > > HFiles > > > > > > > > back to the table, that's not a big deal, but I'm curious to > see > > > if > > > > > we > > > > > > > can > > > > > > > > get that working... > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > BTW, when we try to access the Master webUI while it starts > we > > > are > > > > > > > getting > > > > > > > > an exception: > > > > > > > > 2019-03-14 12:50:40,782 WARN [qtp196717412-78] > > > > > servlet.ServletHandler: > > > > > > > > /master-status > > > > > > > > java.lang.NullPointerException > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.tmpl.master.MasterStatusTmplImpl.renderNoFlush(MasterStatusTmplImpl.java:326) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.tmpl.master.MasterStatusTmpl.renderNoFlush(MasterStatusTmpl.java:397) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.tmpl.master.MasterStatusTmpl.render(MasterStatusTmpl.java:388) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.MasterStatusServlet.doGet(MasterStatusServlet.java:79) > > > > > > > > at > javax.servlet.http.HttpServlet.service(HttpServlet.java:687) > > > > > > > > at > javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > > > > > > > > at > > > > > > org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:848) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1780) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.http.lib.StaticUserWebFilter$StaticUserFilter.doFilter(StaticUserWebFilter.java:112) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > org.eclipse.jetty.servlet.ServletHandler$CachedChain.doFilter(ServletHandler.java:1767) > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.http.ClickjackingPreventionFilter.doFilter(ClickjackingPreventionFilter.java:48) > > > > > > > > > > > > > > > > > > > > > > > > Le jeu. 14 mars 2019 à 08:07, Wellington Chevreuil < > > > > > > > > [email protected]> a écrit : > > > > > > > > > > > > > > > > > Yeah, as I suspected in my previous comment, for this type > of > > > > > > > timeouts, the > > > > > > > > > maintenance mode wouldn't give any help. It's weird that AM > > > starts > > > > > but > > > > > > > > > apparently does nothing until the namespace 5 mins timeout > is > > > > > reached: > > > > > > > > > ... > > > > > > > > > 2019-03-12 20:53:45,942 INFO > > > > > [master/node2:60000:becomeActiveMaster] > > > > > > > > > assignment.AssignmentManager: Joined the cluster in 308msec > > > > > > > > > 2019-03-12 20:54:45,725 INFO > > > > > > > > > [ReadOnlyZKClient-latitude.distparser.com:2181@0x7ea9b2c0] > > > > > > > > > zookeeper.ZooKeeper: Session: 0x16911bd542a02a2 closed > > > > > > > > > 2019-03-12 20:54:45,725 INFO > > > > > > > > > [ReadOnlyZKClient-latitude.distparser.com:2181 > > > > > @0x7ea9b2c0-EventThread] > > > > > > > > > zookeeper.ClientCnxn: EventThread shut down for session: > > > > > > > 0x16911bd542a02a2 > > > > > > > > > 2019-03-12 20:58:46,603 ERROR > > > > > [master/node2:60000:becomeActiveMaster] > > > > > > > > > master.HMaster: Failed to become active master > > > > > > > > > java.lang.IllegalStateException: Expected the service > > > > > > > > > ClusterSchemaServiceImpl [FAILED] to be RUNNING, but the > > > service > > > > > has > > > > > > > FAILED > > > > > > > > > ... > > > > > > > > > > > > > > > > > > I would expect namespace region to be processed by this > call > > > > > > > > > < > > > > > > > > > > > > > > > > > > > > > > > > > https://github.com/apache/hbase/blob/2.2.0-RC0/hbase-server/src/main/java/org/apache/hadoop/hbase/master/HMaster.java#L1088 > > > > > > > > > >, > > > > > > > > > if > > > > > > > > > namespace region is offline, as the timeout suggests. Also > odd > > > is > > > > > that > > > > > > > we > > > > > > > > > don;'t see any logs suggesting offlined regions are getting > > > > > assigned. > > > > > > > Maybe > > > > > > > > > all regions are already online on RSes? But then master > should > > > had > > > > > > > figured > > > > > > > > > that out. Have you already tried restart all RSes? That > could > > > kick > > > > > some > > > > > > > > > reassignments. > > > > > > > > > > > > > > > > > > Em qui, 14 de mar de 2019 às 02:21, Jean-Marc Spaggiari < > > > > > > > > > [email protected]> escreveu: > > > > > > > > > > > > > > > > > > > Hi Wellington, > > > > > > > > > > > > > > > > > > > > Indeed, the META is now deployed. I found the namespace > > > region > > > > > > > encoded > > > > > > > > > > name using hdfs dfs -ls -R /hbase/data/hbase/namespace > and it > > > > > gives > > > > > > > > > > me 7f4a480f47f98300185d1ae2ff663295. But here again, HBCK > > > doesn't > > > > > > > want to > > > > > > > > > > do anything because the master is initializing :( I tried > > > with ad > > > > > > > without > > > > > > > > > > the maintenant flag and I get the same result. > > > > > > > > > > > > > > > > > > > > On HBCK2 side: PleaseHoldException: Master is > initializing > > > > > > > > > > On the master side, it just stoped after 5 minutes > trying to > > > > > assign > > > > > > > > > > namespace :( > > > > > > > > > > > > > > > > > > > > JMS > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Le mer. 13 mars 2019 à 12:04, Wellington Chevreuil < > > > > > > > > > > [email protected]> a écrit : > > > > > > > > > > > > > > > > > > > > > "1588230740" would be the meta region name, not > namespace. > > > It > > > > > seems > > > > > > > > > meta > > > > > > > > > > is > > > > > > > > > > > already online, per below log: > > > > > > > > > > > ... > > > > > > > > > > > 2019-03-12 20:53:41,037 INFO > > > > > > > [master/node2:60000:becomeActiveMaster] > > > > > > > > > > > master.HMaster: hbase:meta {1588230740 state=OPEN, > > > > > > > ts=1552438420570, > > > > > > > > > > > server= > > > > > > > > > > > node7.distparser.com,16020,1552421510124} > > > > > > > > > > > ... > > > > > > > > > > > > > > > > > > > > > > The maintenance mode I suggested before was to have > master > > > > > doing > > > > > > > > > minimum > > > > > > > > > > > required stuff while attempting to getting > meta/namespace > > > > > online, > > > > > > > but I > > > > > > > > > > > guess it wouldn't be able to avoid such timeouts. Below > > > message > > > > > > > also > > > > > > > > > > means > > > > > > > > > > > AM could read meta table, giving another indication > meta is > > > > > fine: > > > > > > > > > > > ... > > > > > > > > > > > 2019-03-12 20:53:45,942 INFO > > > > > > > [master/node2:60000:becomeActiveMaster] > > > > > > > > > > > assignment.AssignmentManager: Joined the cluster in > 308msec > > > > > > > > > > > ... > > > > > > > > > > > > > > > > > > > > > > Now issue is namespace table. For some reason, AM is > not > > > able > > > > > to > > > > > > > kick > > > > > > > > > APs > > > > > > > > > > > before the 5 minutes timeout exceeds, and that's > probably > > > why > > > > > > > namespace > > > > > > > > > > > table never comes available: > > > > > > > > > > > ... > > > > > > > > > > > 2019-03-12 20:53:45,942 INFO > > > > > > > [master/node2:60000:becomeActiveMaster] > > > > > > > > > > > assignment.AssignmentManager: Joined the cluster in > 308msec > > > > > > > > > > > 2019-03-12 20:54:45,725 INFO > > > > > > > > > > > [ReadOnlyZKClient-latitude.distparser.com:2181 > @0x7ea9b2c0] > > > > > > > > > > > zookeeper.ZooKeeper: Session: 0x16911bd542a02a2 closed > > > > > > > > > > > 2019-03-12 20:54:45,725 INFO > > > > > > > > > > > [ReadOnlyZKClient-latitude.distparser.com:2181 > > > > > > > @0x7ea9b2c0-EventThread] > > > > > > > > > > > zookeeper.ClientCnxn: EventThread shut down for > session: > > > > > > > > > > 0x16911bd542a02a2 > > > > > > > > > > > 2019-03-12 20:58:46,603 ERROR > > > > > > > [master/node2:60000:becomeActiveMaster] > > > > > > > > > > > master.HMaster: Failed to become active master > > > > > > > > > > > java.lang.IllegalStateException: Expected the service > > > > > > > > > > > ClusterSchemaServiceImpl [FAILED] to be RUNNING, but > the > > > > > service > > > > > > > has > > > > > > > > > > FAILED > > > > > > > > > > > ... > > > > > > > > > > > > > > > > > > > > > > You may be able to force namespace region coming online > > > with > > > > > hbck2 > > > > > > > > > > assigns > > > > > > > > > > > command. You would need to find out the namespace > region > > > name > > > > > > > first, > > > > > > > > > you > > > > > > > > > > > can either scan meta table or check the region dir > name in > > > hdfs > > > > > > > with > > > > > > > > > > "hdfs > > > > > > > > > > > dfs -ls -R /hbase | grep namespace", in order to pass > it > > > as a > > > > > > > param for > > > > > > > > > > > > > > > > > > > > > > Em qua, 13 de mar de 2019 às 13:00, Jean-Marc > Spaggiari < > > > > > > > > > > > [email protected]> escreveu: > > > > > > > > > > > > > > > > > > > > > > > Hi Sean, > > > > > > > > > > > > > > > > > > > > > > > > I tried. I looked-up the region name for > base:namespace > > > like > > > > > > > this: > > > > > > > > > > > > > > > > > > > > > > > > hdfs dfs -ls /hbase/data/hbase/meta/ > > > > > > > > > > > > > > > > > > > > > > > > And found the region to be 1588230740. > > > > > > > > > > > > > > > > > > > > > > > > The master dies after 5 minutes, so I start the > master, > > > wait > > > > > 2 > > > > > > > > > minutes > > > > > > > > > > to > > > > > > > > > > > > be sure it's up, and run the following command: > > > > > > > > > > > > > > > > > > > > > > > > bin/hbase hbck -j > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > test/hbase-operator-tools/hbase-hbck2/target/hbase-hbck2-1.0.0-SNAPSHOT.jar > > > > > > > > > > > > assigns 1588230740 > > > > > > > > > > > > > > > > > > > > > > > > But HBCK2 doesn't like it: > > > > > > > > > > > > 08:57:35.273 [main] INFO > > > > > > > > > > > > org.apache.hadoop.hbase.client.RpcRetryingCallerImpl > - > > > Call > > > > > > > > > exception, > > > > > > > > > > > > tries=9, retries=16, started=29322 ms ago, > > > cancelled=false, > > > > > > > > > > > > msg=org.apache.hadoop.hbase.PleaseHoldException: > Master > > > is > > > > > > > > > initializing > > > > > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.HMaster.checkInitialized(HMaster.java:3057) > > > > > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.MasterRpcServices.getClusterStatus(MasterRpcServices.java:942) > > > > > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java) > > > > > > > > > > > > at > > > > > org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:413) > > > > > > > > > > > > at > > > > > > > org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:130) > > > > > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:324) > > > > > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:304) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > It keeps retrying and after 16 times it stopped > saying > > > the > > > > > > > master is > > > > > > > > > > not > > > > > > > > > > > > initialized. > > > > > > > > > > > > > > > > > > > > > > > > On the WebUI I can see that there is a single region > > > > > assigned, > > > > > > > the > > > > > > > > > META > > > > > > > > > > > > region. > > > > > > > > > > > > > > > > > > > > > > > > Also, here is the HDFS structure of my META table. > Sounds > > > > > like > > > > > > > some > > > > > > > > > > parts > > > > > > > > > > > > got lost in the process (The info content). > > > > > > > > > > > > > > > > > > > > > > > > hbase@node2:~/hbase-2.2.0$ hdfs dfs -ls -R > > > > > > > /hbase/data/hbase/meta/ > > > > > > > > > > > > drwxr-xr-x - hbase supergroup 0 2019-03-12 > > > 15:42 > > > > > > > > > > > > /hbase/data/hbase/meta/.tabledesc > > > > > > > > > > > > -rw-r--r-- 3 hbase supergroup 1447 2019-03-12 > > > 15:42 > > > > > > > > > > > > > /hbase/data/hbase/meta/.tabledesc/.tableinfo.0000000001 > > > > > > > > > > > > drwxr-xr-x - hbase supergroup 0 2019-03-12 > > > 15:42 > > > > > > > > > > > > /hbase/data/hbase/meta/.tmp > > > > > > > > > > > > drwxr-xr-x - hbase supergroup 0 2019-03-12 > > > 15:49 > > > > > > > > > > > > /hbase/data/hbase/meta/1588230740 > > > > > > > > > > > > -rw-r--r-- 3 hbase supergroup 32 2019-03-12 > > > 15:40 > > > > > > > > > > > > /hbase/data/hbase/meta/1588230740/.regioninfo > > > > > > > > > > > > drwxr-xr-x - hbase supergroup 0 2019-03-12 > > > 15:40 > > > > > > > > > > > > /hbase/data/hbase/meta/1588230740/info > > > > > > > > > > > > drwxr-xr-x - hbase supergroup 0 2019-03-12 > > > 15:40 > > > > > > > > > > > > /hbase/data/hbase/meta/1588230740/recovered.edits > > > > > > > > > > > > -rw-r--r-- 3 hbase supergroup 0 2019-03-12 > > > 15:40 > > > > > > > > > > > > > /hbase/data/hbase/meta/1588230740/recovered.edits/2.seqid > > > > > > > > > > > > drwxr-xr-x - hbase supergroup 0 2019-03-12 > > > 15:42 > > > > > > > > > > > > /hbase/data/hbase/meta/1588230740/rep_barrier > > > > > > > > > > > > drwxr-xr-x - hbase supergroup 0 2019-03-12 > > > 15:47 > > > > > > > > > > > > /hbase/data/hbase/meta/1588230740/table > > > > > > > > > > > > -rw-r--r-- 3 hbase supergroup 5454 2019-03-12 > > > 15:47 > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > /hbase/data/hbase/meta/1588230740/table/b65e8774ff284e77bf22641de36110cc > > > > > > > > > > > > > > > > > > > > > > > > What will be the next best step? > > > > > > > > > > > > > > > > > > > > > > > > Thanks, > > > > > > > > > > > > > > > > > > > > > > > > JMS > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Le mer. 13 mars 2019 à 08:45, Sean Busbey < > > > [email protected]> > > > > > a > > > > > > > > > écrit > > > > > > > > > > : > > > > > > > > > > > > > > > > > > > > > > > > > Okay so master thinks hbase:namespace is already > > > enabled, > > > > > but > > > > > > > no RS > > > > > > > > > > > > > believes it should be hosting the regions. > > > > > > > > > > > > > > > > > > > > > > > > > > Can you find the region name for the > hbase:namespace > > > > > region and > > > > > > > > > issue > > > > > > > > > > > > > an hbck2 assigns command for it? > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Mar 12, 2019 at 8:26 PM Jean-Marc Spaggiari > > > > > > > > > > > > > <[email protected]> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > It doesn't say that much :( > > > > > > > > > > > > > > > > > > > > > > > > > > > > hbase@node2:~/hbase-2.2.0$ cat > > > > > > > logs/hbase-hbase-master-node2.log > > > > > > > > > > | > > > > > > > > > > > > > grep -i > > > > > > > > > > > > > > namespace > > > > > > > > > > > > > > Caused by: java.io.IOException: Timedout 300000ms > > > > > waiting for > > > > > > > > > > > namespace > > > > > > > > > > > > > > table to be assigned and enabled: > > > > > tableName=hbase:namespace, > > > > > > > > > > > > > state=ENABLED > > > > > > > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:108) > > > > > > > > > > > > > > Caused by: java.io.IOException: Timedout 300000ms > > > > > waiting for > > > > > > > > > > > namespace > > > > > > > > > > > > > > table to be assigned and enabled: > > > > > tableName=hbase:namespace, > > > > > > > > > > > > > state=ENABLED > > > > > > > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:108) > > > > > > > > > > > > > > > > > > > > > > > > > > > > I cleared the logs before restarting the > instance. > > > That > > > > > all > > > > > > > what > > > > > > > > > it > > > > > > > > > > > > says > > > > > > > > > > > > > > about namespace. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Full logs are available there: > > > > > https://pastebin.com/9j2Rzdcg > > > > > > > > > > > > > > > > > > > > > > > > > > > > Le mar. 12 mars 2019 à 20:47, Sean Busbey < > > > > > > > [email protected] > > > > > > > > > > > > > > > > > > > > a > > > > > > > > > > > > > écrit : > > > > > > > > > > > > > > > > > > > > > > > > > > > > > okay so the master spent ~5 minutes waiting to > see > > > if > > > > > it > > > > > > > could > > > > > > > > > > get > > > > > > > > > > > > the > > > > > > > > > > > > > > > namespace table working. when it couldn't it > > > aborted. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > can you look back over that 5 minutes and see > what > > > the > > > > > > > master > > > > > > > > > had > > > > > > > > > > > to > > > > > > > > > > > > > > > say about the namespace table? did the master > think > > > > > some > > > > > > > > > > particular > > > > > > > > > > > > > > > server should have it open already? was it > waiting > > > for > > > > > > > someone > > > > > > > > > to > > > > > > > > > > > > > > > finish opening or closing it? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Mar 12, 2019 at 6:39 PM Jean-Marc > Spaggiari > > > > > > > > > > > > > > > <[email protected]> wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Le mar. 12 mars 2019 à 19:25, Sean Busbey < > > > > > > > [email protected] > > > > > > > > > > > > > > > > > > > > a > > > > > > > > > > > > > écrit : > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > your command above points at the wrong jar > > > from the > > > > > > > hbck2 > > > > > > > > > > repo. > > > > > > > > > > > > > it's > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > pointing at the one where you need to > manually > > > > > assemble > > > > > > > all > > > > > > > > > the > > > > > > > > > > > > > > > > > dependencies it has. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > You want the one that does not say > "original" > > > in > > > > > the > > > > > > > name. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Ha!!! That's why! Way easier ;) > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > indeed, this works even without removing all > > > > > environment > > > > > > > > > > > variables: > > > > > > > > > > > > > > > > bin/hbase hbck -j > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > test/hbase-operator-tools/hbase-hbck2/target/hbase-hbck2-1.0.0-SNAPSHOT.jar > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Can you confirm it's the one in > > > > > > > > > > > > > > > > > > > the bin tarball? what does the version > > > command > > > > > > > output? > > > > > > > > > > What > > > > > > > > > > > > > does > > > > > > > > > > > > > > > the > > > > > > > > > > > > > > > > > > > mapredcp command output? What does the > cli > > > help > > > > > > > for the > > > > > > > > > > > hbase > > > > > > > > > > > > > > > command > > > > > > > > > > > > > > > > > > > show? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > hbase@node2:~/hbase-2.2.0$ hbase > mapredcp > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > /home/hbase/hbase-2.2.0/bin/../lib/shaded-clients/hbase-shaded-mapreduce-2.2.0.jar:/home/hbase/hbase-2.2.0/bin/../lib/client-facing-thirdparty/audience-annotations-0.5.0.jar:/home/hbase/hbase-2.2.0/bin/../lib/client-facing-thirdparty/commons-logging-1.2.jar:/home/hbase/hbase-2.2.0/bin/../lib/client-facing-thirdparty/findbugs-annotations-1.3.9-1.jar:/home/hbase/hbase-2.2.0/bin/../lib/client-facing-thirdparty/htrace-core4-4.2.0-incubating.jar:/home/hbase/hbase-2.2.0/bin/../lib/client-facing-thirdparty/log4j-1.2.17.jar:/home/hbase/hbase-2.2.0/bin/../lib/client-facing-thirdparty/slf4j-api-1.7.25.jar > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > that looks great now. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Once you correct the hbck2 jar above I > think > > > > > you'll be > > > > > > > good > > > > > > > > > > for > > > > > > > > > > > > > > > invoking > > > > > > > > > > > > > > > > > HBCK2. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Next, what does the initializing master say > > > it's > > > > > > > doing? It > > > > > > > > > > > should > > > > > > > > > > > > > be > > > > > > > > > > > > > > > > > on the master UI near the bottom. If it > hasn't > > > made > > > > > > > > > progress > > > > > > > > > > > > since > > > > > > > > > > > > > > > > > your last update it'll be waiting for the > > > > > > > hbase:namespace > > > > > > > > > > > table. > > > > > > > > > > > > > If it > > > > > > > > > > > > > > > > > is, find the region and see what the last > few > > > > > messages > > > > > > > in > > > > > > > > > the > > > > > > > > > > > > > master > > > > > > > > > > > > > > > > > log are about that region. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > The master died some times ago. It dies > after 5 > > > > > minutes. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2019-03-12 19:35:58,568 ERROR > > > > > > > > > > > > [master/node2:60000:becomeActiveMaster] > > > > > > > > > > > > > > > > master.HMaster: Failed to become active > master > > > > > > > > > > > > > > > > java.lang.IllegalStateException: Expected the > > > service > > > > > > > > > > > > > > > > ClusterSchemaServiceImpl [FAILED] to be > RUNNING, > > > but > > > > > the > > > > > > > > > > service > > > > > > > > > > > > has > > > > > > > > > > > > > > > FAILED > > > > > > > > > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.checkCurrentState(AbstractService.java:345) > > > > > > > > > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.awaitRunning(AbstractService.java:291) > > > > > > > > > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.HMaster.initClusterSchemaService(HMaster.java:1341) > > > > > > > > > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:1119) > > > > > > > > > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2347) > > > > > > > > > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:595) > > > > > > > > > > > > > > > > at java.lang.Thread.run(Thread.java:748) > > > > > > > > > > > > > > > > Caused by: java.io.IOException: Timedout > 300000ms > > > > > > > waiting for > > > > > > > > > > > > > namespace > > > > > > > > > > > > > > > > table to be assigned and enabled: > > > > > > > tableName=hbase:namespace, > > > > > > > > > > > > > > > state=ENABLED > > > > > > > > > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.TableNamespaceManager.start(TableNamespaceManager.java:108) > > > > > > > > > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.ClusterSchemaServiceImpl.doStart(ClusterSchemaServiceImpl.java:63) > > > > > > > > > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hbase.thirdparty.com.google.common.util.concurrent.AbstractService.startAsync(AbstractService.java:226) > > > > > > > > > > > > > > > > at > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > org.apache.hadoop.hbase.master.HMaster.initClusterSchemaService(HMaster.java:1339) > > > > > > > > > > > > > > > > ... 4 more > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > I just restarted it. I can see the meta table > > > being > > > > > > > > > assigned. I > > > > > > > > > > > can > > > > > > > > > > > > > > > access > > > > > > > > > > > > > > > > the WebUI and I don't see any initializing > > > > > information. > > > > > > > On > > > > > > > > > the > > > > > > > > > > > > table > > > > > > > > > > > > > > > > section, I don't see anything, in any tab. > > > However, > > > > > when > > > > > > > > > doing > > > > > > > > > > > > > "list" on > > > > > > > > > > > > > > > > the shell, I can see my tables. But I can not > > > scan > > > > > them. > > > > > > > > > > Scanning > > > > > > > > > > > > any > > > > > > > > > > > > > > > table > > > > > > > > > > > > > > > > gives : > > > > > > > > > > > > > > > > hbase(main):001:0> scan 'hbase:namespace' > > > > > > > > > > > > > > > > ROW > COLUMN+CELL > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ERROR: Unknown table hbase:namespace! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > For usage try 'help "scan"' > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Took 1.0395 seconds > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > JMS > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > Sean > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >
