[
https://issues.apache.org/jira/browse/HBASE-23079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16938326#comment-16938326
]
Xiaolin Ha edited comment on HBASE-23079 at 9/26/19 7:19 AM:
-------------------------------------------------------------
OK, I'll dig more.
was (Author: xiaolin ha):
OK, I'll dig more. By the way, is this log strange? It is waiting confirm_open
but location is null.
{quote}{{2019-08-29,01:40:31,106 INFO
[master/c3-hadoop-srv-ct30:21500:becomeActiveMaster]
org.apache.hadoop.hbase.procedure2.TimeoutExecutorThread: ADDED pid=56410,
ppid=56043, state=WAITING_TIMEOUT;
org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure; timeout=1005,
timestamp=1567014027020}}
{{2019-08-29,01:40:31,186 INFO
[master/c3-hadoop-srv-ct30:21500:becomeActiveMaster]
org.apache.hadoop.hbase.master.assignment.AssignmentManager: Attach pid=56043,
ppid=55625, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED;
TransitRegionStateProcedure table=c3srv_galaxy:tsdb,
region=c1ccd94593bf1b87269cec98d6ffaaae, ASSIGN to rit=OFFLINE, location=null,
table=c3srv_galaxy:tsdb, region=c1ccd94593bf1b87269cec98d6ffaaae to restore
RIT}}
{quote}
> SCP fails when PE executes rollback of ORP
> -------------------------------------------
>
> Key: HBASE-23079
> URL: https://issues.apache.org/jira/browse/HBASE-23079
> Project: HBase
> Issue Type: Bug
> Reporter: Xiaolin Ha
> Assignee: Xiaolin Ha
> Priority: Major
>
> In our test environment, we found that, when OpenRegionProcedure failed by
> timeout, though it doesn't support rollback, ProcedureExcutor will delete the
> OPR node in the store in its own rollback steps and make the RIT stay there
> looping rollback exception. ServerCrashProcedure will detect this RIT but
> cannot deal with it, because no node in the store and it will encounters NPE,
> as a result, SCP aborts.
>
> Logs are as follows,
> {quote}2019-08-29,07:27:35,710 INFO [PEWorker-15]
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure: pid=70206,
> state=RUNNABLE:SERVER_CRASH_ASSIGN, locked=true; ServerCrashProcedure
> server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, meta=true
> found RIT pid=56043, ppid=55625, state=ROLLEDBACK;
> TransitRegionStateProcedure table=c3srv_galaxy:tsdb,
> region=c1ccd94593bf1b87269cec98d6ffaaae, ASSIGN; rit=OPEN,
> location=c3-hadoop-srv-st297.bj,21600,1567012736142, table=c3srv_galaxy:tsdb,
> region=c1ccd94593bf1b87269cec98d6ffaaae
> 2019-08-29,07:27:35,711 ERROR [PEWorker-15]
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught
> runtime exception: pid=70206, state=RUNNABLE:SERVER_CRASH_ASSIGN,
> locked=true; ServerCrashProcedure
> server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, meta=true
> java.lang.NullPointerException
> at
> org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker.update(ProcedureStoreTracker.java:140)
> at
> org.apache.hadoop.hbase.procedure2.store.ProcedureStoreTracker.update(ProcedureStoreTracker.java:133)
> at
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.updateStoreTracker(WALProcedureStore.java:782)
> at
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.pushData(WALProcedureStore.java:737)
> at
> org.apache.hadoop.hbase.procedure2.store.wal.WALProcedureStore.update(WALProcedureStore.java:604)
> at
> org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.persistAndWake(RegionRemoteProcedureBase.java:182)
> at
> org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.serverCrashed(RegionRemoteProcedureBase.java:239)
> at
> org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.serverCrashed(TransitRegionStateProcedure.java:398)
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.assignRegions(ServerCrashProcedure.java:461)
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:221)
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:64)
> at
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:189)
> at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:962)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1645)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1392)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1962)
> 2019-08-29,07:27:35,714 ERROR [PEWorker-15]
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught
> runtime exception for pid=70206, state=FAILED:SERVER_CRASH_SPLIT_LOGS,
> locked=true, exception=java.lang.NullPointerException via CODE-BUG: Uncaught
> runtime exception: pid=70206, state=RUNNABLE:SERVER_CRASH_ASSIGN,
> locked=true; ServerCrashProcedure
> server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true,
> meta=true:java.lang.NullPointerException; ServerCrashProcedure
> server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, meta=true
> java.lang.UnsupportedOperationException: unhandled state=SERVER_CRASH_ASSIGN
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:333)
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:64)
> at
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:208)
> at
> org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:979)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1566)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeRollback(ProcedureExecutor.java:1498)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1349)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1100(ProcedureExecutor.java:78)
> at
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1962)
> 2019-08-29,07:27:35,716 ERROR [PEWorker-15]
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught
> runtime exception for pid=70206, state=FAILED:SERVER_CRASH_GET_REGIONS,
> locked=true, exception=java.lang.NullPointerException via CODE-BUG: Uncaught
> runtime exception: pid=70206, state=RUNNABLE:SERVER_CRASH_ASSIGN,
> locked=true; ServerCrashProcedure
> server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true,
> meta=true:java.lang.NullPointerException; ServerCrashProcedure
> server=c3-hadoop-srv-st297.bj,21600,1567012736142, splitWal=true, meta=true
> java.lang.UnsupportedOperationException: unhandled
> state=SERVER_CRASH_SPLIT_LOGS
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:333)
> at
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.rollbackState(ServerCrashProcedure.java:64)
> at
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.rollback(StateMachineProcedure.java:208)
> at
> org.apache.hadoop.hbase.procedure2.Procedure.doRollback(Procedure.java:979)
> {quote}
>
> -------------
> {quote}[work@c3-hadoop-srv-ct30 log]$ grep 'pid=56659' *
> hbase.4500.log:2019-08-29,01:40:31,103 INFO
> [master/c3-hadoop-srv-ct30:21500:becomeActiveMaster]
> org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock
> for pid=56659, ppid=56653,
> state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED;
> TransitRegionStateProcedure table=c3prck8smonitorc3:tsdb-service,
> region=413a5199ff68f3505f490c8341efe459, ASSIGN
> hbase.4500.log:2019-08-29,01:40:31,107 INFO
> [master/c3-hadoop-srv-ct30:21500:becomeActiveMaster]
> org.apache.hadoop.hbase.procedure2.TimeoutExecutorThread: ADDED pid=56661,
> ppid=56659, state=WAITING_TIMEOUT;
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure; timeout=1006,
> timestamp=1567014012995
> hbase.4500.log:2019-08-29,01:40:31,186 INFO
> [master/c3-hadoop-srv-ct30:21500:becomeActiveMaster]
> org.apache.hadoop.hbase.master.assignment.AssignmentManager: Attach
> pid=56659, ppid=56653, state=WAITING:REGION_STATE_TRANSITION_CONFIRM_OPENED;
> TransitRegionStateProcedure table=c3prck8smonitorc3:tsdb-service,
> region=413a5199ff68f3505f490c8341efe459, ASSIGN to rit=OFFLINE,
> location=null, table=c3prck8smonitorc3:tsdb-service,
> region=413a5199ff68f3505f490c8341efe459 to restore RIT
> hbase.4500.log:2019-08-29,01:40:31,676 INFO [PEWorker-16]
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: LOCK_EVENT_WAIT
> rollback...pid=56661, ppid=56659, state=FAILED,
> exception=org.apache.hadoop.hbase.exceptions.TimeoutIOException via
> ProcedureExecutor:org.apache.hadoop.hbase.exceptions.TimeoutIOException:
> Operation timed out after 19.6860sec;
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
> hbase.4500.log:2019-08-29,01:40:43,155 ERROR [PEWorker-7]
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught
> runtime exception for pid=56661, ppid=56659, state=FAILED, locked=true,
> exception=org.apache.hadoop.hbase.exceptions.TimeoutIOException via
> ProcedureExecutor:org.apache.hadoop.hbase.exceptions.TimeoutIOException:
> Operation timed out after 19.6860sec;
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
> hbase.4500.log:2019-08-29,01:40:43,759 ERROR [PEWorker-7]
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught
> runtime exception for pid=56661, ppid=56659, state=FAILED, locked=true,
> exception=org.apache.hadoop.hbase.exceptions.TimeoutIOException via
> ProcedureExecutor:org.apache.hadoop.hbase.exceptions.TimeoutIOException:
> Operation timed out after 19.6860sec;
> org.apache.hadoop.hbase.master.assignment.OpenRegionProcedure
> hbase.4500.log:2019-08-29,01:40:44,064 ERROR [PEWorker-7]
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught
> runtime exception for pid=56659, ppid=56653,
> state=WAITING:REGION_STATE_TRANSITION_OPEN, locked=true;
> TransitRegionStateProcedure table=c3prck8smonitorc3:tsdb-service,
> region=413a5199ff68f3505f490c8341efe459, ASSIGN
> hbase.4500.log:2019-08-29,01:40:44,471 INFO [PEWorker-7]
> org.apache.hadoop.hbase.master.procedure.MasterProcedureScheduler: Took xlock
> for pid=56659, ppid=56653, state=WAITING:REGION_STATE_TRANSITION_OPEN;
> TransitRegionStateProcedure table=c3prck8smonitorc3:tsdb-service,
> region=413a5199ff68f3505f490c8341efe459, ASSIGN
> hbase.4500.log:2019-08-29,01:40:44,573 ERROR [PEWorker-7]
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor: CODE-BUG: Uncaught
> runtime exception for pid=56659, ppid=56653,
> state=WAITING:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true;
> TransitRegionStateProcedure table=c3prck8smonitorc3:tsdb-service,
> region=413a5199ff68f3505f490c8341efe459, ASSIGN
> {quote}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)