[ 
https://issues.apache.org/jira/browse/HBASE-13217?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14521110#comment-14521110
 ] 

Stephen Yuan Jiang commented on HBASE-13217:
--------------------------------------------

Actually, we have reproduced the same issue in our branch-1.1 testing on April 
20, 2015.

Here is the issue:
- Master issued 'reached' request as part of 2-phase-commit, the problem is 
that it did not wait for all RS to response and declare that the procedure is 
done and deleted the znodes.
- one RS completed the 2nd phase and tried to update the 'reached' znode, but 
znode was gone and hence threw exception.

Here is what the znode looks like when MASTER thinks the procedure is completed 
(obviously, myregionserver-5 is missing under 'reached' znode
{noformat}
2015-04-20 11:50:19,004 DEBUG 
[(myclustermaster.novalocal,16000,1429530381377)-proc-coordinator-pool2-thread-1]
 procedure.ZKProcedureCoordinatorRpcs: Creating reached barrier zk 
node:/hbase-unsecure/flush-table-proc/reached/MyTargetTable
2015-04-20 11:50:19,010 DEBUG [main-EventThread] 
procedure.ZKProcedureCoordinatorRpcs: Node created: 
/hbase-unsecure/flush-table-proc/reached/MyTargetTable/myregionserver-4.novalocal,16020,1429530401185
2015-04-20 11:50:19,010 DEBUG [main-EventThread] procedure.ZKProcedureUtil: 
Current zk system:
2015-04-20 11:50:19,011 DEBUG [main-EventThread] procedure.ZKProcedureUtil: 
|-/hbase-unsecure/flush-table-proc
2015-04-20 11:50:19,011 DEBUG [main-EventThread] procedure.ZKProcedureUtil: 
|-acquired
2015-04-20 11:50:19,011 DEBUG [main-EventThread] procedure.ZKProcedureUtil: 
|----MyTargetTable
2015-04-20 11:50:19,012 DEBUG [main-EventThread] procedure.ZKProcedureUtil: 
|-------myregionserver-4.novalocal,16020,1429530401185
2015-04-20 11:50:19,012 DEBUG [main-EventThread] procedure.ZKProcedureUtil: 
|-------myregionserver-1.novalocal,16020,1429530398903
2015-04-20 11:50:19,013 DEBUG [main-EventThread] procedure.ZKProcedureUtil: 
|-------myregionserver-5.novalocal,16020,1429530399802
2015-04-20 11:50:19,013 DEBUG [main-EventThread] procedure.ZKProcedureUtil: 
|-------myregionserver-6.novalocal,16020,1429530404517
2015-04-20 11:50:19,014 DEBUG [main-EventThread] procedure.ZKProcedureUtil: 
|-------myclustermaster.novalocal,16020,1429530402734
2015-04-20 11:50:19,014 DEBUG [main-EventThread] procedure.ZKProcedureUtil: 
|-abort
2015-04-20 11:50:19,015 DEBUG [main-EventThread] procedure.ZKProcedureUtil: 
|-reached
2015-04-20 11:50:19,015 DEBUG [main-EventThread] procedure.ZKProcedureUtil: 
|----MyTargetTable
2015-04-20 11:50:19,016 DEBUG [main-EventThread] procedure.ZKProcedureUtil: 
|-------myregionserver-4.novalocal,16020,1429530401185
2015-04-20 11:50:19,016 DEBUG [main-EventThread] procedure.ZKProcedureUtil: 
|-------myregionserver-1.novalocal,16020,1429530398903
2015-04-20 11:50:19,017 DEBUG [main-EventThread] procedure.ZKProcedureUtil: 
|-------myregionserver-6.novalocal,16020,1429530404517
2015-04-20 11:50:19,017 DEBUG [main-EventThread] procedure.ZKProcedureUtil: 
|-------myclustermaster.novalocal,16020,1429530402734
2015-04-20 11:50:19,018 INFO  
[(myclustermaster.novalocal,16000,1429530381377)-proc-coordinator-pool2-thread-1]
 procedure.Procedure: Procedure 'MyTargetTable' execution completed
2015-04-20 11:50:19,018 INFO  
[(myclustermaster.novalocal,16000,1429530381377)-proc-coordinator-pool2-thread-1]
 procedure.ZKProcedureUtil: Clearing all znodes for procedure MyTargetTable 
including nodes /hbase-unsecure/flush-table-proc/acquired 
/hbase-unsecure/flush-table-proc/reached /hbase-unsecure/flush-table-proc/abort
{noformat}

> Flush procedure fails in trunk due to ZK issue
> ----------------------------------------------
>
>                 Key: HBASE-13217
>                 URL: https://issues.apache.org/jira/browse/HBASE-13217
>             Project: HBase
>          Issue Type: Bug
>            Reporter: ramkrishna.s.vasudevan
>            Assignee: Stephen Yuan Jiang
>
> When ever I try to flush explicitly in the trunk code the flush procedure 
> fails due to ZK issue
> {code}
> ERROR: org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable 
> via 
> stobdtserver3,16040,1426172670959:org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable:
>  java.io.IOException: org.apache.zookeeper.KeeperException$NoNodeException: 
> KeeperErrorCode = NoNode for 
> /hbase/flush-table-proc/acquired/TestTable/stobdtserver3,16040,1426172670959
>         at 
> org.apache.hadoop.hbase.errorhandling.ForeignExceptionDispatcher.rethrowException(ForeignExceptionDispatcher.java:83)
>         at 
> org.apache.hadoop.hbase.procedure.Procedure.isCompleted(Procedure.java:368)
>         at 
> org.apache.hadoop.hbase.procedure.flush.MasterFlushTableProcedureManager.isProcedureDone(MasterFlushTableProcedureManager.java:196)
>         at 
> org.apache.hadoop.hbase.master.MasterRpcServices.isProcedureDone(MasterRpcServices.java:905)
>         at 
> org.apache.hadoop.hbase.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java:47019)
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2073)
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:107)
>         at 
> org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:130)
>         at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:107)
>         at java.lang.Thread.run(Thread.java:745)
> Caused by: 
> org.apache.hadoop.hbase.errorhandling.ForeignException$ProxyThrowable: 
> java.io.IOException: org.apache.zookeeper.KeeperException$NoNodeException: 
> KeeperErrorCode = NoNode for 
> /hbase/flush-table-proc/acquired/TestTable/stobdtserver3,16040,1426172670959
>         at 
> org.apache.hadoop.hbase.procedure.Subprocedure.cancel(Subprocedure.java:273)
>         at 
> org.apache.hadoop.hbase.procedure.ProcedureMember.controllerConnectionFailure(ProcedureMember.java:225)
>         at 
> org.apache.hadoop.hbase.procedure.ZKProcedureMemberRpcs.sendMemberAcquired(ZKProcedureMemberRpcs.java:254)
>         at 
> org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:166)
>         at 
> org.apache.hadoop.hbase.procedure.Subprocedure.call(Subprocedure.java:52)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>         at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>         at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         ... 1 more
> {code}
> Once this occurs, even on restart of the RS the RS becomes unusable.  I have 
> verified that the ZK remains intact and there is no problem with it.  a bit 
> older version of trunk ( 3months) does not have this problem.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to