[ https://issues.apache.org/jira/browse/HBASE-22940?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Xiaolin Ha updated HBASE-22940: ------------------------------- Attachment: detailed snapshot nonode errror logs.txt > Snapshot NoNode error > --------------------- > > Key: HBASE-22940 > URL: https://issues.apache.org/jira/browse/HBASE-22940 > Project: HBase > Issue Type: Bug > Components: snapshots > Reporter: Xiaolin Ha > Assignee: Xiaolin Ha > Priority: Minor > Attachments: detailed snapshot nonode errror logs.txt > > > When we take snapshot for thousands tables on our cluster, we found there > occasionally occurs NoNodeException,error stack is as follows, > {quote}ERROR: org.apache.hadoop.hbase.snapshot.HBaseSnapshotException: > Snapshot \{ ss=KYLIN_2JAU7T91XU_mtzjyprc > table=kylin_zjyprc_bigdata_staging:KYLIN_2JAU7T91XU type=FLUSH } had an > error. Procedure KYLIN_2JAU7T91XU_mtzjyprc \{ waiting=[] done=[] } at > org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:350) > at org.apache.hadoop.hbase.master.HMaster.isSnapshotDone(HMaster.java:3674) > at > org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java:44817) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2059) at > org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:126) at > org.apache.hadoop.hbase.ipc.MasterFifoRpcScheduler.lambda$dispatch$1(MasterFifoRpcScheduler.java:68) > at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) Caused by: > org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable via > zjy-hadoop-prc-st1309.bj,24600,1557969473924:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: > java.io.IOException: org.apache.zookeeper.KeeperException$NoNodeException: > KeeperErrorCode = NoNode for > /hbase/zjyprc-xiaomi/online-snapshot/reached/KYLIN_2JAU7T91XU_mtzjyprc/zjy-hadoop-prc-st1309.bj,24600,1557969473924 > at > org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:83) > at > org.apache.hadoop.hbase.master.snapshot.TakeSnapshotHandler.rethrowExceptionIfFailed(TakeSnapshotHandler.java:312) > at > org.apache.hadoop.hbase.master.snapshot.SnapshotManager.isSnapshotDone(SnapshotManager.java:340) > ... 10 more Caused by: > org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: > java.io.IOException: org.apache.zookeeper.KeeperException$NoNodeException: > KeeperErrorCode = NoNode for > /hbase/zjyprc-xiaomi/online-snapshot/reached/KYLIN_2JAU7T91XU_mtzjyprc/zjy-hadoop-prc-st1309.bj,24600,1557969473924 > at > org.apache.hadoop.hbase.procedure.Subprocedure.cancel(Subprocedure.java:270) > at > org.apache.hadoop.hbase.procedure.ProcedureMember.controllerConnectionFailure(ProcedureMember.java:225) > at > org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.sendMemberCompleted(ZKProcedureMemberRpcs.java:267) > at > org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:185) at > org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:52) ... > 4 more @zjy-hadoop-prc-zk05.bj/10.152.48.41:24500 Here is some help for this > command: Take a snapshot of specified table. Examples: hbase> snapshot > 'sourceTable', 'snapshotName' hbase> snapshot 'namespace:sourceTable', > 'snapshotName', \{SKIP_FLUSH => true} > {quote} > I looked through relevant server logs, and found that currently > implementation of snapshot has some problems. When creating Procedure for > snapshot, the regions servers where table regions on will be set as acquired > and released barriers. Master watches zk and if all the barrier region > servers have added nodes to the parent reached node, coordinator releases ALL > the barriers and snapshot procedure will be thought as completed. Followed by > the relevant parent reached/required node be cleared by `resetMembers()`. But > all the region servers will add node to the parent reached/required node, so > non-barrier region servers add children will encounter NoNodeException at > this time. > We think the coordinator only set relevant region servers as barriers may be > not enough. All region servers adds node and may be all can be barriers. > > > -- This message was sent by Atlassian Jira (v8.3.2#803003)