[
https://issues.apache.org/jira/browse/HBASE-8998?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Matteo Bertozzi updated HBASE-8998:
-----------------------------------
Attachment: HBASE-8998-v0.patch
I will not bet on the v0 patch as fix, but... from the logs I can see that
The master is starting the procedure and is waiting on the acquire RS znode
{code}
2013-07-18 23:07:53,478 DEBUG
[(vesta.apache.org,33937,1374188840534)-proc-coordinator-pool1-thread-1]
procedure.Procedure(245): Starting procedure 'snapshot_test1374188869207',
kicking off acquire phase on members.
2013-07-18 23:07:53,491 DEBUG
[(vesta.apache.org,33937,1374188840534)-proc-coordinator-pool1-thread-1]
procedure.ZKProcedureCoordinatorRpcs(102): Watching for acquire
node:/hbase/online-snapshot/acquired/snapshot_test1374188869207/vesta.apache.org,37857,1374188841025
{code}
then the RS is joining the acquire znode (and starts watching for the next step)
{code}
2013-07-18 23:07:53,498 DEBUG [member: 'vesta.apache.org,37857,1374188841025'
subprocedure-pool3-thread-1] procedure.ZKProcedureMemberRpcs(239): Member:
'vesta.apache.org,37857,1374188841025' joining acquired barrier for procedure
(snapshot_test1374188869207) in zk
2013-07-18 23:07:53,518 DEBUG [member: 'vesta.apache.org,54011,1374188841053'
subprocedure-pool2-thread-1] procedure.ZKProcedureMemberRpcs(247): Watch for
global barrier reached:/hbase/online-snapshot/reached/snapshot_test1374188869207
{code}
The master receives the notification... "Hey the RS is ready"
{code}
2013-07-18 23:07:53,532 DEBUG [pool-1-thread-1-EventThread]
procedure.ZKProcedureCoordinatorRpcs$1(171): Node created:
/hbase/online-snapshot/acquired/snapshot_test1374188869207/vesta.apache.org,37857,1374188841025
2013-07-18 23:07:53,532 DEBUG [pool-1-thread-1-EventThread]
procedure.ZKProcedureUtil(231): Current zk system:
2013-07-18 23:07:53,533 DEBUG [pool-1-thread-1-EventThread]
procedure.ZKProcedureUtil(233): |-/hbase/online-snapshot
2013-07-18 23:07:53,534 DEBUG [pool-1-thread-1-EventThread]
procedure.ZKProcedureUtil(250): |-acquired
2013-07-18 23:07:53,535 DEBUG [pool-1-thread-1-EventThread]
procedure.ZKProcedureUtil(250): |----snapshot_test1374188869207
2013-07-18 23:07:53,536 DEBUG [pool-1-thread-1-EventThread]
procedure.ZKProcedureUtil(250): |-------vesta.apache.org,37857,1374188841025
2013-07-18 23:07:53,536 DEBUG [pool-1-thread-1-EventThread]
procedure.ZKProcedureUtil(250): |-------vesta.apache.org,54011,1374188841053
2013-07-18 23:07:53,537 DEBUG [pool-1-thread-1-EventThread]
procedure.ZKProcedureUtil(250): |-abort
2013-07-18 23:07:53,538 DEBUG [pool-1-thread-1-EventThread]
procedure.ZKProcedureUtil(250): |-reached
{code}
but at this point I don't see logs from the barrierAcquiredByMember()...
which lead me to the procedures.get() may return null...
and looking at the code there's a strange synchronization around the task start
with the task starting before the procedure is inserted in the list...
{code}
ZKProcedureMemberRpcs.java
public void nodeCreated(String path) {
LOG.debug("Node created: " + path);
logZKTree(this.baseZNode);
if (isAcquiredPathNode(path)) {
// node wasn't present when we created the watch so zk event triggers
acquire
coordinator.memberAcquiredBarrier(ZKUtil.getNodeName(ZKUtil.getParent(path)),
ZKUtil.getNodeName(path));
...
}
ProcedureCoordinator.java
void memberAcquiredBarrier(String procName, final String member) {
Procedure proc = procedures.get(procName);
if (proc != null) {
proc.barrierAcquiredByMember(member);
}
}
Procedure.java
public void barrierAcquiredByMember(String member) {
LOG.debug("member: '" + member + "' joining acquired barrier for procedure '"
+ procName + "' on coordinator");
...
{code}
> TestSnapshotCloneIndependence.testOnlineSnapshotMetadataChangesIndependent
> fails
> --------------------------------------------------------------------------------
>
> Key: HBASE-8998
> URL: https://issues.apache.org/jira/browse/HBASE-8998
> Project: HBase
> Issue Type: Bug
> Reporter: stack
> Attachments: HBASE-8998-v0.patch
>
>
> https://builds.apache.org/job/HBase-TRUNK/4270/testReport/org.apache.hadoop.hbase.client/TestSnapshotCloneIndependence/testOnlineSnapshotMetadataChangesIndependent/
> {code}
> Error Message
> org.apache.hadoop.hbase.exceptions.HBaseSnapshotException: Snapshot {
> ss=snapshot_test1374188869207 table=test1374188869207 type=FLUSH } had an
> error. Procedure snapshot_test1374188869207 {
> waiting=[vesta.apache.org,37857,1374188841025] done=[] }
> Stacktrace
> org.apache.hadoop.hbase.exceptions.HBaseSnapshotException:
> org.apache.hadoop.hbase.exceptions.HBaseSnapshotException: Snapshot {
> ss=snapshot_test1374188869207 table=test1374188869207 type=FLUSH } had an
> error. Procedure snapshot_test1374188869207 {
> waiting=[vesta.apache.org,37857,1374188841025] done=[] }
> at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
> at
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
> at
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
> at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
> at
> org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:95)
> at
> org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:79)
> at
> org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:232)
> at
> org.apache.hadoop.hbase.client.HBaseAdmin.executeCallable(HBaseAdmin.java:2701)
> at
> org.apache.hadoop.hbase.client.HBaseAdmin.execute(HBaseAdmin.java:2670)
> at
> org.apache.hadoop.hbase.client.HBaseAdmin.snapshot(HBaseAdmin.java:2304)
> at
> org.apache.hadoop.hbase.client.HBaseAdmin.snapshot(HBaseAdmin.java:2251)
> at
> org.apache.hadoop.hbase.client.HBaseAdmin.snapshot(HBaseAdmin.java:2203)
> at
> org.apache.hadoop.hbase.snapshot.SnapshotTestingUtils.createSnapshotAndValidate(SnapshotTestingUtils.java:379)
> at
> org.apache.hadoop.hbase.snapshot.SnapshotTestingUtils.createSnapshotAndValidate(SnapshotTestingUtils.java:410)
> at
> org.apache.hadoop.hbase.client.TestSnapshotCloneIndependence.runTestSnapshotMetadataChangesIndependent(TestSnapshotCloneIndependence.java:341)
> at
> org.apache.hadoop.hbase.client.TestSnapshotCloneIndependence.testOnlineSnapshotMetadataChangesIndependent(TestSnapshotCloneIndependence.java:145)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:597)
> at
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
> at
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
> at
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
> at
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
> at
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
> at
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
> at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:271)
> at
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:70)
> ...
> {code}
> Mighty [~mbertozzi]? Want to take a look at this failure? Thanks boss.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira