[
https://issues.apache.org/jira/browse/HBASE-10000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13843012#comment-13843012
]
Nicolas Liochon commented on HBASE-10000:
-----------------------------------------
I reviewed the 0.96 version:
I found only one serious issue for a very specific config. Everything else if
nit. Still, the code is very complex, I spent more than 1 hour reviewing these
10 lines, but I'm not sure I covered all cases...
bq. if (nbAttempt == 0 && isFileClosedMeth == null && firstPause > 0) {
if the master initiated the recovery more than 4 seconds ago AND there is not
isFileClosed on the region server or namenode, we will do the second the
recoverLease after 1 minute: our recovery time will always be > 1 minute with
this implementation. It's serious, because may be the file is actually
recovered but we don't even check.
With the previous implementation, we would have done the second attempt after 4
seconds.
Nits:
bq. if (recovered) break;
Would be in the loop above. Basically; we don't really need the 'recovered'
variable, we always return immediately when it's true.
{code}
if (isFileClosedMeth != null && isFileClosed(dfs, isFileClosedMeth,
p)) {
recovered = true;
break;
}
{code}
could be
{code}
if (isFileClosedMeth != null && isFileClosed(dfs, isFileClosedMeth,
p)) {
return true;
}
{code}
bq. // On the first time through wait the short 'firstPause'
The comment does not match the code (isFileClosedMeth and master recovery time
are actually taken into account). We actually do a firstPause only if
isFileClosed is not available, and then we calculate the firstPause time by
taking into account the time when the recovery was started, master included.
I wonder if we should not max the firstPause, for example if the master time is
not synchronized with the region server. Considering the criticallity of this
code it would be safer imho.
> Initiate lease recovery for outstanding WAL files at the very beginning of
> recovery
> -----------------------------------------------------------------------------------
>
> Key: HBASE-10000
> URL: https://issues.apache.org/jira/browse/HBASE-10000
> Project: HBase
> Issue Type: Improvement
> Reporter: Ted Yu
> Assignee: Ted Yu
> Fix For: 0.98.1
>
> Attachments: 10000-0.96-v5.txt, 10000-recover-ts-with-pb-2.txt,
> 10000-recover-ts-with-pb-3.txt, 10000-recover-ts-with-pb-4.txt,
> 10000-recover-ts-with-pb-5.txt, 10000-v4.txt, 10000-v5.txt, 10000-v6.txt
>
>
> At the beginning of recovery, master can send lease recovery requests
> concurrently for outstanding WAL files using a thread pool.
> Each split worker would first check whether the WAL file it processes is
> closed.
> Thanks to Nicolas Liochon and Jeffery discussion with whom gave rise to this
> idea.
--
This message was sent by Atlassian JIRA
(v6.1.4#6159)