[
https://issues.apache.org/jira/browse/HBASE-8666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13673629#comment-13673629
]
Jeffrey Zhong commented on HBASE-8666:
--------------------------------------
{quote}
How can this happen? Is this okay? May be because log replaying/splitting of
the RS which was hosting meta is clear by the time master was initialized (and
therefore getFailedServersFromLogFolders() couldn't find it?)
{quote}
Might be or it may due to the fact that I killed Master right before it could
delete the recovering META from ZK while the same server still has non-meta wal
to be replayed so that the HMaster#removeStaleRecoveringRegionsFromZK can't
delete the stale META region.
{quote}
Right. That's why I commented that would lastRecoveringNodeCreationTime be
reset to 0 instead of Long.MAXIMUM in removeRecoveringRegionsFromZK once all
such znodes are cleared?
{quote}
I see your point now. This is a good question. Below is the related code
{code}
// no splitting work items left
deleteRecoveringRegionZNodes(null);
// reset lastRecoveringNodeCreationTime because we cleared all
recovering znodes at
// this point.
lastRecoveringNodeCreationTime = Long.MAX_VALUE;
{code}
The reason I set lastRecoveringNodeCreationTime=Long.MAX_VALUE is that after
deleteRecoveringRegionZNodes call, all recovering regions are removed. So there
is no need to run the GC process every one second especially it adds some read
traffic to ZK because nothing to be GCed.
We only need the GC by the time a new split work is scheduled where
lastRecoveringNodeCreationTime will be set to current time again.
Initializing the removeRecoveringRegionsFromZK to 0 is because during master
start up, the previous state is gone we don't know if we need to run GC or not.
So just trigger it once after master initialization in case we have orphan
items to be cleared. After the GC called once and
lastRecoveringNodeCreationTime=Long.MAX_VALUE, the GC stops till we have new
recovery work items coming in.
> META region isn't fully recovered during master initialization when META
> region recovery had chained failures
> -------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-8666
> URL: https://issues.apache.org/jira/browse/HBASE-8666
> Project: HBase
> Issue Type: Bug
> Components: MTTR
> Reporter: Jeffrey Zhong
> Assignee: Jeffrey Zhong
> Fix For: 0.98.0, 0.95.2
>
> Attachments: hbase-8666.patch, hbase-8666-v2.patch,
> hbase-8666-v3.patch
>
>
> In distributedLogReplay mode when Meta recovery had experienced chained
> failures(recovery failed multiple times in a row), META region can't be fully
> recovered during master starts up.
>
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira