[ 
https://issues.apache.org/jira/browse/HBASE-23079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16940273#comment-16940273
 ] 

Duo Zhang commented on HBASE-23079:
-----------------------------------

{quote}
Duo Zhang you showed failed at SPLIT_TABLE_REGION_UPDATE_META, I think it was a 
failed step in rollback, and there is other steps failed logs,
{quote}

But anyway, we must fail at a step after SPLIT_TABLE_REGION_UPDATE_META right? 
It has already crossed the PONR, so by design it should not fail.

And still do not know what is a ORP timeout... You can see the logic in 
RegionRemoteProcedureBase.remoteCallFailed, it will not mark the procedure as 
failure, it just complete the procedure and let the parent procedure retry. So 
we need to find out where is the TimeoutIOException thrown.

> SCP fails when PE executes rollback of ORP 
> -------------------------------------------
>
>                 Key: HBASE-23079
>                 URL: https://issues.apache.org/jira/browse/HBASE-23079
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Xiaolin Ha
>            Assignee: Xiaolin Ha
>            Priority: Major
>
> In our test environment, we found that, when OpenRegionProcedure failed by 
> timeout, though it doesn't support rollback, ProcedureExcutor will delete the 
> OPR node in the store in its own rollback steps and make the RIT stay there 
> looping rollback exception. ServerCrashProcedure will detect this RIT but 
> cannot deal with it, because no node in the store and it will encounters NPE, 
> as a result, SCP aborts.
>  
> Logs are as follows,
> {quote}2019-08-29,07:27:35,710 INFO [PEWorker-15] 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: pid=70206, 
> state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; ServerCrashProcedure 
> server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, meta=true 
> found RIT pid=56043, ppid=55625, state=ROLLEDBACK; 
> TransitRegionStateProcedure table=c3srv_galaxy:tsdb, 
> region=c1ccd94593bf1b87269cec98d6ffaaae, ASSIGN; rit=OPEN, 
> location=c3-hadoop-srv-st297.bj,21600,1567012736142, table=c3srv_galaxy:tsdb, 
> region=c1ccd94593bf1b87269cec98d6ffaaae
> 2019-08-29,07:27:35,711 ERROR [PEWorker-15] 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught 
> runtime exception: pid=70206, state=RUNNABLE:SERVER_CRASH_ASSIGN, 
> locked=true; ServerCrashProcedure 
> server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, meta=true
> java.lang.NullPointerException
>  at 
> org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker.update(ProcedureStoreTracker.java:140)
>  at 
> org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker.update(ProcedureStoreTracker.java:133)
>  at 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.updateStoreTracker(WALProcedureStore.java:782)
>  at 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.pushData(WALProcedureStore.java:737)
>  at 
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.update(WALProcedureStore.java:604)
>  at 
> org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.persistAndWake(RegionRemoteProcedureBase.java:182)
>  at 
> org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.serverCrashed(RegionRemoteProcedureBase.java:239)
>  at 
> org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.serverCrashed(TransitRegionStateProcedure.java:398)
>  at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.assignRegions(ServerCrashProcedure.java:461)
>  at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:221)
>  at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:64)
>  at 
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:189)
>  at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:962)
>  at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1645)
>  at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1392)
>  at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
>  at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1962)
> 2019-08-29,07:27:35,714 ERROR [PEWorker-15] 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught 
> runtime exception for pid=70206, state=FAILED:SERVER_CRASH_SPLIT_LOGS, 
> locked=true, exception=java.lang.NullPointerException via CODE-BUG: Uncaught 
> runtime exception: pid=70206, state=RUNNABLE:SERVER_CRASH_ASSIGN, 
> locked=true; ServerCrashProcedure 
> server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, 
> meta=true:java.lang.NullPointerException; ServerCrashProcedure 
> server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, meta=true
> java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_ASSIGN
>  at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:333)
>  at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:64)
>  at 
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:208)
>  at 
> org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:979)
>  at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1566)
>  at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1498)
>  at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1349)
>  at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
>  at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1962)
> 2019-08-29,07:27:35,716 ERROR [PEWorker-15] 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught 
> runtime exception for pid=70206, state=FAILED:SERVER_CRASH_GET_REGIONS, 
> locked=true, exception=java.lang.NullPointerException via CODE-BUG: Uncaught 
> runtime exception: pid=70206, state=RUNNABLE:SERVER_CRASH_ASSIGN, 
> locked=true; ServerCrashProcedure 
> server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, 
> meta=true:java.lang.NullPointerException; ServerCrashProcedure 
> server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, meta=true
> java.lang.UnsupportedOperationException: unhandled 
> state=SERVER_CRASH_SPLIT_LOGS
>  at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:333)
>  at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:64)
>  at 
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:208)
>  at 
> org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:979)
> {quote}
>  
> -------------
> {quote}[work@c3-hadoop-srv-ct30 log]$ grep 'pid=56659' *
> hbase.4500.log:2019-08-29,01:40:31,103 INFO 
> [master/c3-hadoop-srv-ct30:21500:becomeActiveMaster] 
> org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock 
> for pid=56659, ppid=56653, 
> state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED; 
> TransitRegionStateProcedure table=c3prck8smonitorc3:tsdb-service, 
> region=413a5199ff68f3505f490c8341efe459, ASSIGN
> hbase.4500.log:2019-08-29,01:40:31,107 INFO 
> [master/c3-hadoop-srv-ct30:21500:becomeActiveMaster] 
> org.apache.hadoop.hbase.procedure2.TimeoutExecutorThread: ADDED pid=56661, 
> ppid=56659, state=WAITING_TIMEOUT; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure; timeout=1006, 
> timestamp=1567014012995
> hbase.4500.log:2019-08-29,01:40:31,186 INFO 
> [master/c3-hadoop-srv-ct30:21500:becomeActiveMaster] 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: Attach 
> pid=56659, ppid=56653, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED; 
> TransitRegionStateProcedure table=c3prck8smonitorc3:tsdb-service, 
> region=413a5199ff68f3505f490c8341efe459, ASSIGN to rit=OFFLINE, 
> location=null, table=c3prck8smonitorc3:tsdb-service, 
> region=413a5199ff68f3505f490c8341efe459 to restore RIT
> hbase.4500.log:2019-08-29,01:40:31,676 INFO [PEWorker-16] 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: LOCK_EVENT_WAIT 
> rollback...pid=56661, ppid=56659, state=FAILED, 
> exception=org.apache.hadoop.hbase.exceptions.TimeoutIOException via 
> ProcedureExecutor:org.apache.hadoop.hbase.exceptions.TimeoutIOException: 
> Operation timed out after 19.6860sec; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
> hbase.4500.log:2019-08-29,01:40:43,155 ERROR [PEWorker-7] 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught 
> runtime exception for pid=56661, ppid=56659, state=FAILED, locked=true, 
> exception=org.apache.hadoop.hbase.exceptions.TimeoutIOException via 
> ProcedureExecutor:org.apache.hadoop.hbase.exceptions.TimeoutIOException: 
> Operation timed out after 19.6860sec; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
> hbase.4500.log:2019-08-29,01:40:43,759 ERROR [PEWorker-7] 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught 
> runtime exception for pid=56661, ppid=56659, state=FAILED, locked=true, 
> exception=org.apache.hadoop.hbase.exceptions.TimeoutIOException via 
> ProcedureExecutor:org.apache.hadoop.hbase.exceptions.TimeoutIOException: 
> Operation timed out after 19.6860sec; 
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
> hbase.4500.log:2019-08-29,01:40:44,064 ERROR [PEWorker-7] 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught 
> runtime exception for pid=56659, ppid=56653, 
> state=WAITING:REGION_STATE_TRANSITION_OPEN, locked=true; 
> TransitRegionStateProcedure table=c3prck8smonitorc3:tsdb-service, 
> region=413a5199ff68f3505f490c8341efe459, ASSIGN
> hbase.4500.log:2019-08-29,01:40:44,471 INFO [PEWorker-7] 
> org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock 
> for pid=56659, ppid=56653, state=WAITING:REGION_STATE_TRANSITION_OPEN; 
> TransitRegionStateProcedure table=c3prck8smonitorc3:tsdb-service, 
> region=413a5199ff68f3505f490c8341efe459, ASSIGN
> hbase.4500.log:2019-08-29,01:40:44,573 ERROR [PEWorker-7] 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught 
> runtime exception for pid=56659, ppid=56653, 
> state=WAITING:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true; 
> TransitRegionStateProcedure table=c3prck8smonitorc3:tsdb-service, 
> region=413a5199ff68f3505f490c8341efe459, ASSIGN
> {quote}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to