[ 
https://issues.apache.org/jira/browse/HBASE-19554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16360455#comment-16360455
 ] 

Duo Zhang commented on HBASE-19554:
-----------------------------------

After revisiting the log, I think I found a possible dead lock in recovery.

It is very simple, we only have 16 threads to run the procedures. During a 
failover, one region will lead to one AssignProcedure, and if meta region is 
also offline, then every AssignProcedure will be stuck at 
AssignProcedure.startTransition where we need to read something from meta 
region. So if the RecoverMetaProcedure can not be picked up by one of the 
worker thread before all of them are stuck at AssignProcedure.startTransition, 
then dead lock...

So I think we should have a special worker thread which can only execute meta 
related operations. And if a TableProcedureInterface is for meta table, then we 
should insert it into a special queue and let the special worker to run it 
immediately.

Let me open a new issue to address it.

Thanks.

> AbstractTestDLS.testThreeRSAbort sometimes fails in pre commit
> --------------------------------------------------------------
>
>                 Key: HBASE-19554
>                 URL: https://issues.apache.org/jira/browse/HBASE-19554
>             Project: HBase
>          Issue Type: Sub-task
>          Components: Recovery, wal
>            Reporter: Duo Zhang
>            Assignee: Duo Zhang
>            Priority: Major
>             Fix For: 2.0.0-beta-2
>
>         Attachments: HBASE-19554-thread-dump.patch, HBASE-19554.patch
>
>
> https://builds.apache.org/job/PreCommit-HBASE-Build/10554/artifact/patchprocess/patch-unit-hbase-server.txt
> The error message is a bit strange:
> {quote}
> [ERROR] testThreeRSAbort(org.apache.hadoop.hbase.master.TestDLSAsyncFSWAL) 
> Time elapsed: 20.627 s <<< ERROR!
> org.apache.hadoop.hbase.TableNotFoundException: Region of 
> 'hbase:namespace,,1513320505933.451650152885a3b41d0b1110deca513c.' is 
> expected in the table of 'testThreeRSAbort', but hbase:meta says it is in the 
> table of 'hbase:namespace'. hbase:meta might be damaged.
> {quote}
> It fails for both FSHLog and AsyncFSWAL. Need to dig more.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to