[ 
https://issues.apache.org/jira/browse/HBASE-8666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13673629#comment-13673629
 ] 

Jeffrey Zhong commented on HBASE-8666:
--------------------------------------

{quote}
How can this happen? Is this okay? May be because log replaying/splitting of 
the RS which was hosting meta is clear by the time master was initialized (and 
therefore getFailedServersFromLogFolders() couldn't find it?)
{quote}
Might be or it may due to the fact that I killed Master right before it could 
delete the recovering META from ZK while the same server still has non-meta wal 
to be replayed so that the HMaster#removeStaleRecoveringRegionsFromZK can't 
delete the stale META region.

{quote}
 Right. That's why I commented that would lastRecoveringNodeCreationTime be 
reset to 0 instead of Long.MAXIMUM in removeRecoveringRegionsFromZK once all 
such znodes are cleared?
{quote}
I see your point now. This is a good question. Below is the related code 
{code}
        // no splitting work items left
        deleteRecoveringRegionZNodes(null);
        // reset lastRecoveringNodeCreationTime because we cleared all 
recovering znodes at
        // this point.
        lastRecoveringNodeCreationTime = Long.MAX_VALUE;
{code}
The reason I set lastRecoveringNodeCreationTime=Long.MAX_VALUE is that after 
deleteRecoveringRegionZNodes call, all recovering regions are removed. So there 
is no need to run the GC process every one second especially it adds some read 
traffic to ZK because nothing to be GCed.
We only need the GC by the time a new split work is scheduled where 
lastRecoveringNodeCreationTime will be set to current time again.

Initializing the removeRecoveringRegionsFromZK to 0 is because during master 
start up, the previous state is gone we don't know if we need to run GC or not. 
So just trigger it once after master initialization in case we have orphan 
items to be cleared. After the GC called once and 
lastRecoveringNodeCreationTime=Long.MAX_VALUE, the GC stops till we have new 
recovery work items coming in.


                
> META region isn't fully recovered during master initialization when META 
> region recovery had chained failures
> -------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-8666
>                 URL: https://issues.apache.org/jira/browse/HBASE-8666
>             Project: HBase
>          Issue Type: Bug
>          Components: MTTR
>            Reporter: Jeffrey Zhong
>            Assignee: Jeffrey Zhong
>             Fix For: 0.98.0, 0.95.2
>
>         Attachments: hbase-8666.patch, hbase-8666-v2.patch, 
> hbase-8666-v3.patch
>
>
> In distributedLogReplay mode when Meta recovery had experienced chained 
> failures(recovery failed multiple times in a row), META region can't be fully 
> recovered during master starts up.
>  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to