[ 
https://issues.apache.org/jira/browse/HBASE-19976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16361685#comment-16361685
 ] 

Duo Zhang commented on HBASE-19976:
-----------------------------------

Seems I added the thread dump to wrong place so there is no thread dump when 
failure...

Anyway, see here

https://builds.apache.org/job/HBASE-Flaky-Tests/25832/artifact/hbase-server/target/surefire-reports/org.apache.hadoop.hbase.master.TestDLSFSHLog-output.txt/*view*/

{noformat}
2018-02-12 04:56:54,563 WARN  [ProcExecTimeout] 
procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck 
PEWorker-1(pid=139) run time 31.6840sec
2018-02-12 04:56:54,563 WARN  [ProcExecTimeout] 
procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck 
PEWorker-2(pid=146) run time 29.5870sec
2018-02-12 04:56:54,563 WARN  [ProcExecTimeout] 
procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck 
PEWorker-3(pid=150) run time 29.5880sec
2018-02-12 04:56:54,563 WARN  [ProcExecTimeout] 
procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck 
PEWorker-4(pid=142) run time 31.6870sec
2018-02-12 04:56:54,563 WARN  [ProcExecTimeout] 
procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck 
PEWorker-5(pid=138) run time 31.6830sec
2018-02-12 04:56:54,563 WARN  [ProcExecTimeout] 
procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck 
PEWorker-6(pid=140) run time 31.6840sec
2018-02-12 04:56:54,564 WARN  [ProcExecTimeout] 
procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck 
PEWorker-7(pid=141) run time 31.6880sec
2018-02-12 04:56:54,564 WARN  [ProcExecTimeout] 
procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck 
PEWorker-8(pid=143) run time 31.6890sec
2018-02-12 04:56:54,564 WARN  [ProcExecTimeout] 
procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck 
PEWorker-9(pid=137) run time 31.6840sec
2018-02-12 04:56:54,564 WARN  [ProcExecTimeout] 
procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck 
PEWorker-10(pid=136) run time 31.6840sec
2018-02-12 04:56:54,564 WARN  [ProcExecTimeout] 
procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck 
PEWorker-11(pid=149) run time 29.5880sec
2018-02-12 04:56:54,564 WARN  [ProcExecTimeout] 
procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck 
PEWorker-12(pid=148) run time 29.5880sec
2018-02-12 04:56:54,564 WARN  [ProcExecTimeout] 
procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck 
PEWorker-13(pid=144) run time 29.5870sec
2018-02-12 04:56:54,564 WARN  [ProcExecTimeout] 
procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck 
PEWorker-14(pid=145) run time 29.5870sec
2018-02-12 04:56:54,564 WARN  [ProcExecTimeout] 
procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck 
PEWorker-15(pid=147) run time 29.5880sec
2018-02-12 04:56:54,564 WARN  [ProcExecTimeout] 
procedure2.ProcedureExecutor$WorkerMonitor(1985): Worker stuck 
PEWorker-16(pid=151) run time 29.5890sec
{noformat}

All procedures are stuck. And let's check all the procedures.
{noformat}
2018-02-12 04:56:22,879 INFO  [PEWorker-1] 
procedure.MasterProcedureScheduler(883): pid=139, ppid=130, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=testThreeRSAbort, 
region=d36808157b0edc272844a07587e3630e testThreeRSAbort 
testThreeRSAbort,o@\x17\xAB\xCE,1518411364183.d36808157b0edc272844a07587e3630e.
2018-02-12 04:56:24,976 INFO  [PEWorker-2] 
procedure.MasterProcedureScheduler(883): pid=146, ppid=131, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=testThreeRSAbort, 
region=7726f3d31204e2e60fc38582fefddfdb testThreeRSAbort 
testThreeRSAbort,f\xAA\x08Y),1518411364183.7726f3d31204e2e60fc38582fefddfdb.
2018-02-12 04:56:24,975 INFO  [PEWorker-3] 
procedure.MasterProcedureScheduler(883): pid=150, ppid=131, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=testThreeRSAbort, 
region=693d6cddbd1f127dd087ee20def3f081 testThreeRSAbort 
testThreeRSAbort,s6\x94\xE5\xA4,1518411364183.693d6cddbd1f127dd087ee20def3f081.
2018-02-12 04:56:22,880 INFO  [PEWorker-4] 
procedure.MasterProcedureScheduler(883): pid=142, ppid=130, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=testThreeRSAbort, 
region=0d0f98adccc5c3430f2981524a9cdd12 testThreeRSAbort 
testThreeRSAbort,u\xDA\xE8a\x88,1518411364183.0d0f98adccc5c3430f2981524a9cdd12.
2018-02-12 04:56:22,880 INFO  [PEWorker-5] 
procedure.MasterProcedureScheduler(883): pid=138, ppid=130, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=testThreeRSAbort, 
region=15f6ffa81a9f6f469f917050199b8a8c testThreeRSAbort 
testThreeRSAbort,g\xFC2\x17\x1B,1518411364183.15f6ffa81a9f6f469f917050199b8a8c.
2018-02-12 04:56:22,880 INFO  [PEWorker-6] 
procedure.MasterProcedureScheduler(883): pid=140, ppid=130, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=testThreeRSAbort, 
region=82b1ce3cc98b041a73162d359366df5d testThreeRSAbort 
testThreeRSAbort,p\x92Ai\xC0,1518411364183.82b1ce3cc98b041a73162d359366df5d.
2018-02-12 04:56:22,881 INFO  [PEWorker-7] 
procedure.MasterProcedureScheduler(883): pid=141, ppid=130, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=testThreeRSAbort, 
region=4e0473ab4c62db5b9372f9434e7a9ec1 testThreeRSAbort 
testThreeRSAbort,r\x8D\x80\x06\xAB,1518411364183.4e0473ab4c62db5b9372f9434e7a9ec1.
2018-02-12 04:56:22,879 INFO  [PEWorker-8] 
procedure.MasterProcedureScheduler(883): pid=143, ppid=130, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=testThreeRSAbort, 
region=09760f994679eafb83ec28e5b8d61944 testThreeRSAbort 
testThreeRSAbort,w-\x12\x1Fz,1518411364183.09760f994679eafb83ec28e5b8d61944.
2018-02-12 04:56:22,881 INFO  [PEWorker-9] 
procedure.MasterProcedureScheduler(883): pid=137, ppid=130, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=testThreeRSAbort, 
region=a5c3a01d1ecf966a7f6f90cc955a4fe1 testThreeRSAbort 
testThreeRSAbort,b\x0Av@Z,1518411364183.a5c3a01d1ecf966a7f6f90cc955a4fe1.
2018-02-12 04:56:22,880 INFO  [PEWorker-10] 
procedure.MasterProcedureScheduler(883): pid=136, ppid=130, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=testThreeRSAbort, 
region=af34788a32b503977634237b3d0394b5 testThreeRSAbort 
testThreeRSAbort,aaaaa,1518411364183.af34788a32b503977634237b3d0394b5.
2018-02-12 04:56:24,976 INFO  [PEWorker-11] 
procedure.MasterProcedureScheduler(883): pid=149, ppid=131, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=testThreeRSAbort, 
region=07e5494382a7546646c1f27367f2c264 testThreeRSAbort 
testThreeRSAbort,q;VH\xB9,1518411364183.07e5494382a7546646c1f27367f2c264
2018-02-12 04:56:24,976 INFO  [PEWorker-12] 
procedure.MasterProcedureScheduler(883): pid=148, ppid=131, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=testThreeRSAbort, 
region=aa78c0bb206f32bd6f0d444dd6f6b822 testThreeRSAbort 
testThreeRSAbort,kI\x9Aq\xF8,1518411364183.aa78c0bb206f32bd6f0d444dd6f6b822.
2018-02-12 04:56:24,977 INFO  [PEWorker-13] 
procedure.MasterProcedureScheduler(883): pid=144, ppid=131, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=testThreeRSAbort, 
region=2f1e68b3fdba85d5092d6908a1752795 testThreeRSAbort 
testThreeRSAbort,c\x5C\x9F\xFEL,1518411364183.2f1e68b3fdba85d5092d6908a1752795.
2018-02-12 04:56:24,977 INFO  [PEWorker-14] 
procedure.MasterProcedureScheduler(883): pid=145, ppid=131, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=testThreeRSAbort, 
region=c5994de9d0ebc552c8a95270835a819f testThreeRSAbort 
testThreeRSAbort,f\x00\xF3z0,1518411364183.c5994de9d0ebc552c8a95270835a819f.
2018-02-12 04:56:24,976 INFO  [PEWorker-15] 
procedure.MasterProcedureScheduler(883): pid=147, ppid=131, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=testThreeRSAbort, 
region=f9089be5ddd1684a17f7d3ce3714c5ca testThreeRSAbort 
testThreeRSAbort,i\xF7p\xB4\x06,1518411364183.f9089be5ddd1684a17f7d3ce3714c5ca.
2018-02-12 04:56:24,975 INFO  [PEWorker-16] 
procedure.MasterProcedureScheduler(883): pid=151, ppid=131, 
state=RUNNABLE:REGION_TRANSITION_QUEUE; AssignProcedure table=testThreeRSAbort, 
region=4b9f7f49b787a4d83fcccc2a00ae8639 testThreeRSAbort 
testThreeRSAbort,y(P\xBCe,1518411364183.4b9f7f49b787a4d83fcccc2a00ae8639.
{noformat}

All of them are AssignProcedures and no doubt there will be stuck since meta 
region is offline...

Thanks.

> Dead lock if the worker threads in procedure executor are exhausted
> -------------------------------------------------------------------
>
>                 Key: HBASE-19976
>                 URL: https://issues.apache.org/jira/browse/HBASE-19976
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Duo Zhang
>            Assignee: stack
>            Priority: Critical
>
> See the comments in HBASE-19554. If all the worker threads are stuck in 
> AssignProcdure since meta region is offline, then the RecoverMetaProcedure 
> can not be executed and cause dead lock.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to