[ 
https://issues.apache.org/jira/browse/HBASE-8998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Matteo Bertozzi updated HBASE-8998:
-----------------------------------

    Attachment: HBASE-8998-v0.patch

I will not bet on the v0 patch as fix, but... from the logs I can see that

The master is starting the procedure and is waiting on the acquire RS znode
{code}
2013-07-18 23:07:53,478 DEBUG 
[(vesta.apache.org,33937,1374188840534)-proc-coordinator-pool1-thread-1] 
procedure.Procedure(245): Starting procedure 'snapshot_test1374188869207', 
kicking off acquire phase on members.
2013-07-18 23:07:53,491 DEBUG 
[(vesta.apache.org,33937,1374188840534)-proc-coordinator-pool1-thread-1] 
procedure.ZKProcedureCoordinatorRpcs(102): Watching for acquire 
node:/hbase/online-snapshot/acquired/snapshot_test1374188869207/vesta.apache.org,37857,1374188841025
{code}

then the RS is joining the acquire znode (and starts watching for the next step)
{code}
2013-07-18 23:07:53,498 DEBUG [member: 'vesta.apache.org,37857,1374188841025' 
subprocedure-pool3-thread-1] procedure.ZKProcedureMemberRpcs(239): Member: 
'vesta.apache.org,37857,1374188841025' joining acquired barrier for procedure 
(snapshot_test1374188869207) in zk
2013-07-18 23:07:53,518 DEBUG [member: 'vesta.apache.org,54011,1374188841053' 
subprocedure-pool2-thread-1] procedure.ZKProcedureMemberRpcs(247): Watch for 
global barrier reached:/hbase/online-snapshot/reached/snapshot_test1374188869207
{code}

The master receives the notification... "Hey the RS is ready"
{code}
2013-07-18 23:07:53,532 DEBUG [pool-1-thread-1-EventThread] 
procedure.ZKProcedureCoordinatorRpcs$1(171): Node created: 
/hbase/online-snapshot/acquired/snapshot_test1374188869207/vesta.apache.org,37857,1374188841025
2013-07-18 23:07:53,532 DEBUG [pool-1-thread-1-EventThread] 
procedure.ZKProcedureUtil(231): Current zk system:
2013-07-18 23:07:53,533 DEBUG [pool-1-thread-1-EventThread] 
procedure.ZKProcedureUtil(233): |-/hbase/online-snapshot
2013-07-18 23:07:53,534 DEBUG [pool-1-thread-1-EventThread] 
procedure.ZKProcedureUtil(250): |-acquired
2013-07-18 23:07:53,535 DEBUG [pool-1-thread-1-EventThread] 
procedure.ZKProcedureUtil(250): |----snapshot_test1374188869207
2013-07-18 23:07:53,536 DEBUG [pool-1-thread-1-EventThread] 
procedure.ZKProcedureUtil(250): |-------vesta.apache.org,37857,1374188841025
2013-07-18 23:07:53,536 DEBUG [pool-1-thread-1-EventThread] 
procedure.ZKProcedureUtil(250): |-------vesta.apache.org,54011,1374188841053
2013-07-18 23:07:53,537 DEBUG [pool-1-thread-1-EventThread] 
procedure.ZKProcedureUtil(250): |-abort
2013-07-18 23:07:53,538 DEBUG [pool-1-thread-1-EventThread] 
procedure.ZKProcedureUtil(250): |-reached
{code}

but at this point I don't see logs from the barrierAcquiredByMember()...
which lead me to the procedures.get() may return null...
and looking at the code there's a strange synchronization around the task start
with the task starting before the procedure is inserted in the list...
{code}
ZKProcedureMemberRpcs.java
public void nodeCreated(String path) {
  LOG.debug("Node created: " + path);
  logZKTree(this.baseZNode);
  if (isAcquiredPathNode(path)) {
    // node wasn't present when we created the watch so zk event triggers 
acquire
    
coordinator.memberAcquiredBarrier(ZKUtil.getNodeName(ZKUtil.getParent(path)),
      ZKUtil.getNodeName(path));
  ...
}

ProcedureCoordinator.java
void memberAcquiredBarrier(String procName, final String member) {
  Procedure proc = procedures.get(procName);
  if (proc != null) {
    proc.barrierAcquiredByMember(member);
  }
}

Procedure.java
public void barrierAcquiredByMember(String member) {
  LOG.debug("member: '" + member + "' joining acquired barrier for procedure '" 
+ procName + "' on coordinator");
  ...
{code}
                
> TestSnapshotCloneIndependence.testOnlineSnapshotMetadataChangesIndependent 
> fails
> --------------------------------------------------------------------------------
>
>                 Key: HBASE-8998
>                 URL: https://issues.apache.org/jira/browse/HBASE-8998
>             Project: HBase
>          Issue Type: Bug
>            Reporter: stack
>         Attachments: HBASE-8998-v0.patch
>
>
> https://builds.apache.org/job/HBase-TRUNK/4270/testReport/org.apache.hadoop.hbase.client/TestSnapshotCloneIndependence/testOnlineSnapshotMetadataChangesIndependent/
> {code}
> Error Message
> org.apache.hadoop.hbase.exceptions.HBaseSnapshotException: Snapshot { 
> ss=snapshot_test1374188869207 table=test1374188869207 type=FLUSH } had an 
> error.  Procedure snapshot_test1374188869207 { 
> waiting=[vesta.apache.org,37857,1374188841025] done=[] }
> Stacktrace
> org.apache.hadoop.hbase.exceptions.HBaseSnapshotException: 
> org.apache.hadoop.hbase.exceptions.HBaseSnapshotException: Snapshot { 
> ss=snapshot_test1374188869207 table=test1374188869207 type=FLUSH } had an 
> error.  Procedure snapshot_test1374188869207 { 
> waiting=[vesta.apache.org,37857,1374188841025] done=[] }
>       at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>       at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
>       at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
>       at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
>       at 
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
>       at 
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79)
>       at 
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:232)
>       at 
> org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:2701)
>       at 
> org.apache.hadoop.hbase.client.HBaseAdmin.execute(HBaseAdmin.java:2670)
>       at 
> org.apache.hadoop.hbase.client.HBaseAdmin.snapshot(HBaseAdmin.java:2304)
>       at 
> org.apache.hadoop.hbase.client.HBaseAdmin.snapshot(HBaseAdmin.java:2251)
>       at 
> org.apache.hadoop.hbase.client.HBaseAdmin.snapshot(HBaseAdmin.java:2203)
>       at 
> org.apache.hadoop.hbase.snapshot.SnapshotTestingUtils.createSnapshotAndValidate(SnapshotTestingUtils.java:379)
>       at 
> org.apache.hadoop.hbase.snapshot.SnapshotTestingUtils.createSnapshotAndValidate(SnapshotTestingUtils.java:410)
>       at 
> org.apache.hadoop.hbase.client.TestSnapshotCloneIndependence.runTestSnapshotMetadataChangesIndependent(TestSnapshotCloneIndependence.java:341)
>       at 
> org.apache.hadoop.hbase.client.TestSnapshotCloneIndependence.testOnlineSnapshotMetadataChangesIndependent(TestSnapshotCloneIndependence.java:145)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>       at java.lang.reflect.Method.invoke(Method.java:597)
>       at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
>       at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>       at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
>       at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>       at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
>       at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
>       at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
>       at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
> ...
> {code}
> Mighty [~mbertozzi]?  Want to take a look at this failure?  Thanks boss.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to