[ 
https://issues.apache.org/jira/browse/HBASE-22193?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16813185#comment-16813185
 ] 

Duo Zhang commented on HBASE-22193:
-----------------------------------

Talked with [~zghaobac] offline, the problem here is not the retry number, but 
the retry interval.

When a region is failed open, we will try to reassign it ASAP, the intention 
here is to make the region online soon. But sometimes, the region can not 
online on any RS because of config error or some other problems, then it is not 
a good idea to retry immediately as it will lead to so many proc wals...

So the first thing is to detect this problem and increase the retry interval... 
And for a long term solution, I think we need to find out a way to better deal 
with config error. For now, the ModifyTableProcedure will hang there forever 
and the only way is to use HBCK2 to bypass the procedure and fix the table 
state, which is a bit difficult. For hbase version before 2.0, I think there is 
a straight forward way to fix this is to disable the table, fix the schema, and 
enable it again...

> Reduce the default ASSIGN_MAX_ATTEMPTS config
> ---------------------------------------------
>
>                 Key: HBASE-22193
>                 URL: https://issues.apache.org/jira/browse/HBASE-22193
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Guanghao Zhang
>            Priority: Major
>
>  
> {code:java}
> public static final String ASSIGN_MAX_ATTEMPTS =
>     "hbase.assignment.maximum.attempts";
> private static final int DEFAULT_ASSIGN_MAX_ATTEMPTS = Integer.MAX_VALUE;
> {code}
> Now the default config is Integer.MAX_VALUE. 
>  
> {code:java}
> 2019-04-09,10:50:44,921 INFO 
> org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure: 
> Retry=170813 of max=2147483647; pid=2849, ppid=2846, 
> state=RUNNABLE:REGION_STATE_TRANSITION_CONFIRM_OPENED, locked=true; 
> TransitRegionStateProcedure table=IntegrationTestBigLinkedList, 
> region=634feb79a583480597e1843647d11228, REOPEN/MOVE; rit=OPENING, 
> location=c4-hadoop-tst-st26.bj,29100,1554262369262
> {code}
> The ITBLL failed to open the region as HBASE-22163 and retry 170813 to 
> reopen. After I fixed the problem and restart master, I found it need take a 
> long time to init the old procedure logs because there are too many old 
> logs...
> Code in WALProcedureStore,java.
>  
> {code:java}
> private long initOldLogs(FileStatus[] logFiles) throws IOException {
>   if (logFiles == null || logFiles.length == 0) {
>     return 0L;
>   }
>   long maxLogId = 0;
>   for (int i = 0; i < logFiles.length; ++i) {
>     final Path logPath = logFiles[i].getPath();
>     leaseRecovery.recoverFileLease(fs, logPath);
>     if (!isRunning()) {
>       throw new IOException("wal aborting");
>     }
>     maxLogId = Math.max(maxLogId, getLogIdFromName(logPath.getName()));
>     ProcedureWALFile log = initOldLog(logFiles[i], this.walArchiveDir);
>     if (log != null) {
>       this.logs.add(log);
>     }
>   }
>   initTrackerFromOldLogs();
>   return maxLogId;
> }
> {code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to