[ 
https://issues.apache.org/jira/browse/HBASE-14362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14745644#comment-14745644
 ] 

Heng Chen commented on HBASE-14362:
-----------------------------------

After analysis the log, i found something 
Testcase failed is due to the exception 
{code}
org.apache.hadoop.ipc.RemoteException(java.io.IOException): File 
/test-logs/state-00000000000000000018.log could only be replicated to 2 nodes 
instead of minReplication (=3).  There are 3 datanode(s) running and 3 node(s) 
are excluded in this operation.
{code}

There are lots of this kind exceptions in log, and it appears from the 
beginning of the log.

But most of this exception is catched in {{WALProcedureStore}}, except the last 
one which was thrown by method {{syncSlots}} when logRolled times larger than 
{{maxSyncFailureRoll}} 

{code}
  private long syncSlots() throws Throwable {
    int retry = 0;
    int logRolled = 0;
    long totalSynced = 0;
    do {
      try {
        totalSynced = syncSlots(stream, slots, 0, slotIndex);
        break;
      } catch (Throwable e) {
        if (++retry >= maxRetriesBeforeRoll) {
          if (logRolled >= maxSyncFailureRoll) {
            LOG.error("Sync slots after log roll failed, abort.", e);
            sendAbortProcessSignal();
            throw e;   // here, the exception is throw out,  and cause the 
syncLoop exit!!
          }

          if (!rollWriterOrDie()) {
            throw e;
          }

          logRolled++;
          retry = 0;
        }
      }
    } while (isRunning());
    return totalSynced;
  }
{code}

So if i set {{hbase.procedure.store.wal.wait.before.roll}} and 
{{hbase.procedure.store.wal.sync.failure.roll.max}} to be a smaller number,  
the testcase will always run failed.


So to fix this issue,  we could increase the number when test-env is slow. Or 
we catch the exception.

.


> org.apache.hadoop.hbase.master.procedure.TestWALProcedureStoreOnHDFS is super 
> duper flaky
> -----------------------------------------------------------------------------------------
>
>                 Key: HBASE-14362
>                 URL: https://issues.apache.org/jira/browse/HBASE-14362
>             Project: HBase
>          Issue Type: Bug
>          Components: test
>    Affects Versions: 2.0.0
>            Reporter: Dima Spivak
>            Priority: Critical
>
> [As seen in 
> Jenkins|https://builds.apache.org/job/HBase-TRUNK/lastCompletedBuild/testReport/org.apache.hadoop.hbase.master.procedure/TestWALProcedureStoreOnHDFS/history/],
>  this test has been super flaky and we should probably address it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to