[
https://issues.apache.org/jira/browse/HBASE-14362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14745644#comment-14745644
]
Heng Chen commented on HBASE-14362:
-----------------------------------
After analysis the log, i found something
Testcase failed is due to the exception
{code}
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File
/test-logs/state-00000000000000000018.log could only be replicated to 2 nodes
instead of minReplication (=3). There are 3 datanode(s) running and 3 node(s)
are excluded in this operation.
{code}
There are lots of this kind exceptions in log, and it appears from the
beginning of the log.
But most of this exception is catched in {{WALProcedureStore}}, except the last
one which was thrown by method {{syncSlots}} when logRolled times larger than
{{maxSyncFailureRoll}}
{code}
private long syncSlots() throws Throwable {
int retry = 0;
int logRolled = 0;
long totalSynced = 0;
do {
try {
totalSynced = syncSlots(stream, slots, 0, slotIndex);
break;
} catch (Throwable e) {
if (++retry >= maxRetriesBeforeRoll) {
if (logRolled >= maxSyncFailureRoll) {
LOG.error("Sync slots after log roll failed, abort.", e);
sendAbortProcessSignal();
throw e; // here, the exception is throw out, and cause the
syncLoop exit!!
}
if (!rollWriterOrDie()) {
throw e;
}
logRolled++;
retry = 0;
}
}
} while (isRunning());
return totalSynced;
}
{code}
So if i set {{hbase.procedure.store.wal.wait.before.roll}} and
{{hbase.procedure.store.wal.sync.failure.roll.max}} to be a smaller number,
the testcase will always run failed.
So to fix this issue, we could increase the number when test-env is slow. Or
we catch the exception.
.
> org.apache.hadoop.hbase.master.procedure.TestWALProcedureStoreOnHDFS is super
> duper flaky
> -----------------------------------------------------------------------------------------
>
> Key: HBASE-14362
> URL: https://issues.apache.org/jira/browse/HBASE-14362
> Project: HBase
> Issue Type: Bug
> Components: test
> Affects Versions: 2.0.0
> Reporter: Dima Spivak
> Priority: Critical
>
> [As seen in
> Jenkins|https://builds.apache.org/job/HBase-TRUNK/lastCompletedBuild/testReport/org.apache.hadoop.hbase.master.procedure/TestWALProcedureStoreOnHDFS/history/],
> this test has been super flaky and we should probably address it.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)