[
https://issues.apache.org/jira/browse/HBASE-23079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16938310#comment-16938310
]
Duo Zhang commented on HBASE-23079:
-----------------------------------
I think the problem here is that why OpenRegionProcedure could fail with a
TimeoutIOException... By design it will not 'fail'...
> SCP fails when PE executes rollback of ORP
> -------------------------------------------
>
> Key: HBASE-23079
> URL: https://issues.apache.org/jira/browse/HBASE-23079
> Project: HBase
> Issue Type: Bug
> Reporter: Xiaolin Ha
> Assignee: Xiaolin Ha
> Priority: Major
>
> In our test environment, we found that, when OpenRegionProcedure failed by
> timeout, though it doesn't support rollback, ProcedureExcutor will delete the
> OPR node in the store in its own rollback steps and make the RIT stay there
> looping rollback exception. ServerCrashProcedure will detect this RIT but
> cannot deal with it, because no node in the store and it will encounters NPE,
> as a result, SCP aborts.
>
> Logs are as follows,
> {quote}2019-08-29,07:27:35,710 INFO [PEWorker-15]
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: pid=70206,
> state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; ServerCrashProcedure
> server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, meta=true
> found RIT pid=56043, ppid=55625, state=ROLLEDBACK;
> TransitRegionStateProcedure table=c3srv_galaxy:tsdb,
> region=c1ccd94593bf1b87269cec98d6ffaaae, ASSIGN; rit=OPEN,
> location=c3-hadoop-srv-st297.bj,21600,1567012736142, table=c3srv_galaxy:tsdb,
> region=c1ccd94593bf1b87269cec98d6ffaaae
> 2019-08-29,07:27:35,711 ERROR [PEWorker-15]
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught
> runtime exception: pid=70206, state=RUNNABLE:SERVER_CRASH_ASSIGN,
> locked=true; ServerCrashProcedure
> server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, meta=true
> java.lang.NullPointerException
> at
> org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker.update(ProcedureStoreTracker.java:140)
> at
> org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker.update(ProcedureStoreTracker.java:133)
> at
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.updateStoreTracker(WALProcedureStore.java:782)
> at
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.pushData(WALProcedureStore.java:737)
> at
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.update(WALProcedureStore.java:604)
> at
> org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.persistAndWake(RegionRemoteProcedureBase.java:182)
> at
> org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.serverCrashed(RegionRemoteProcedureBase.java:239)
> at
> org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.serverCrashed(TransitRegionStateProcedure.java:398)
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.assignRegions(ServerCrashProcedure.java:461)
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:221)
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:64)
> at
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:189)
> at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:962)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1645)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1392)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1962)
> 2019-08-29,07:27:35,714 ERROR [PEWorker-15]
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught
> runtime exception for pid=70206, state=FAILED:SERVER_CRASH_SPLIT_LOGS,
> locked=true, exception=java.lang.NullPointerException via CODE-BUG: Uncaught
> runtime exception: pid=70206, state=RUNNABLE:SERVER_CRASH_ASSIGN,
> locked=true; ServerCrashProcedure
> server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true,
> meta=true:java.lang.NullPointerException; ServerCrashProcedure
> server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, meta=true
> java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_ASSIGN
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:333)
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:64)
> at
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:208)
> at
> org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:979)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1566)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1498)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1349)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1962)
> 2019-08-29,07:27:35,716 ERROR [PEWorker-15]
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught
> runtime exception for pid=70206, state=FAILED:SERVER_CRASH_GET_REGIONS,
> locked=true, exception=java.lang.NullPointerException via CODE-BUG: Uncaught
> runtime exception: pid=70206, state=RUNNABLE:SERVER_CRASH_ASSIGN,
> locked=true; ServerCrashProcedure
> server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true,
> meta=true:java.lang.NullPointerException; ServerCrashProcedure
> server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, meta=true
> java.lang.UnsupportedOperationException: unhandled
> state=SERVER_CRASH_SPLIT_LOGS
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:333)
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:64)
> at
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:208)
> at
> org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:979)
> {quote}
>
> -------------
> {quote}[work@c3-hadoop-srv-ct30 log]$ grep 'pid=56659' *
> hbase.4500.log:2019-08-29,01:40:31,103 INFO
> [master/c3-hadoop-srv-ct30:21500:becomeActiveMaster]
> org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock
> for pid=56659, ppid=56653,
> state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED;
> TransitRegionStateProcedure table=c3prck8smonitorc3:tsdb-service,
> region=413a5199ff68f3505f490c8341efe459, ASSIGN
> hbase.4500.log:2019-08-29,01:40:31,107 INFO
> [master/c3-hadoop-srv-ct30:21500:becomeActiveMaster]
> org.apache.hadoop.hbase.procedure2.TimeoutExecutorThread: ADDED pid=56661,
> ppid=56659, state=WAITING_TIMEOUT;
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure; timeout=1006,
> timestamp=1567014012995
> hbase.4500.log:2019-08-29,01:40:31,186 INFO
> [master/c3-hadoop-srv-ct30:21500:becomeActiveMaster]
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: Attach
> pid=56659, ppid=56653, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED;
> TransitRegionStateProcedure table=c3prck8smonitorc3:tsdb-service,
> region=413a5199ff68f3505f490c8341efe459, ASSIGN to rit=OFFLINE,
> location=null, table=c3prck8smonitorc3:tsdb-service,
> region=413a5199ff68f3505f490c8341efe459 to restore RIT
> hbase.4500.log:2019-08-29,01:40:31,676 INFO [PEWorker-16]
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: LOCK_EVENT_WAIT
> rollback...pid=56661, ppid=56659, state=FAILED,
> exception=org.apache.hadoop.hbase.exceptions.TimeoutIOException via
> ProcedureExecutor:org.apache.hadoop.hbase.exceptions.TimeoutIOException:
> Operation timed out after 19.6860sec;
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
> hbase.4500.log:2019-08-29,01:40:43,155 ERROR [PEWorker-7]
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught
> runtime exception for pid=56661, ppid=56659, state=FAILED, locked=true,
> exception=org.apache.hadoop.hbase.exceptions.TimeoutIOException via
> ProcedureExecutor:org.apache.hadoop.hbase.exceptions.TimeoutIOException:
> Operation timed out after 19.6860sec;
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
> hbase.4500.log:2019-08-29,01:40:43,759 ERROR [PEWorker-7]
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught
> runtime exception for pid=56661, ppid=56659, state=FAILED, locked=true,
> exception=org.apache.hadoop.hbase.exceptions.TimeoutIOException via
> ProcedureExecutor:org.apache.hadoop.hbase.exceptions.TimeoutIOException:
> Operation timed out after 19.6860sec;
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
> hbase.4500.log:2019-08-29,01:40:44,064 ERROR [PEWorker-7]
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught
> runtime exception for pid=56659, ppid=56653,
> state=WAITING:REGION_STATE_TRANSITION_OPEN, locked=true;
> TransitRegionStateProcedure table=c3prck8smonitorc3:tsdb-service,
> region=413a5199ff68f3505f490c8341efe459, ASSIGN
> hbase.4500.log:2019-08-29,01:40:44,471 INFO [PEWorker-7]
> org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock
> for pid=56659, ppid=56653, state=WAITING:REGION_STATE_TRANSITION_OPEN;
> TransitRegionStateProcedure table=c3prck8smonitorc3:tsdb-service,
> region=413a5199ff68f3505f490c8341efe459, ASSIGN
> hbase.4500.log:2019-08-29,01:40:44,573 ERROR [PEWorker-7]
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught
> runtime exception for pid=56659, ppid=56653,
> state=WAITING:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true;
> TransitRegionStateProcedure table=c3prck8smonitorc3:tsdb-service,
> region=413a5199ff68f3505f490c8341efe459, ASSIGN
> {quote}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)