[jira] [Commented] (HBASE-10000) Initiate lease recovery for outstanding WAL files at the very beginning of recovery

Nicolas Liochon (JIRA) Mon, 09 Dec 2013 01:39:45 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-10000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13843012#comment-13843012
 ]


Nicolas Liochon commented on HBASE-10000:
-----------------------------------------

I reviewed the 0.96 version:

I found only one serious issue for a very specific config. Everything else if 
nit. Still, the code is very complex, I spent more than 1 hour reviewing these 
10 lines, but I'm not sure I covered all cases...
bq.         if (nbAttempt == 0 && isFileClosedMeth == null && firstPause > 0) {
if the master initiated the recovery more than 4 seconds ago AND there is not 
isFileClosed on the region server or namenode, we will do the second the 
recoverLease after 1 minute: our recovery time will always be > 1 minute with 
this implementation. It's serious, because may be the file is actually 
recovered but we don't even check.
With the previous implementation, we would have done the second attempt after 4 
seconds.

Nits:

bq.       if (recovered) break;
Would be in the loop above. Basically; we don't really need the 'recovered' 
variable, we always return immediately when it's true.

{code}
            if (isFileClosedMeth != null && isFileClosed(dfs, isFileClosedMeth, 
p)) {
              recovered = true;
              break;
            }
{code}

could be
{code}
            if (isFileClosedMeth != null && isFileClosed(dfs, isFileClosedMeth, 
p)) {
              return true;
            }
{code}

bq.  // On the first time through wait the short 'firstPause'
The comment does not match the code (isFileClosedMeth and master recovery time 
are actually taken into account). We actually do a firstPause only if 
isFileClosed is not available, and then we calculate the firstPause time by 
taking into account the time when the recovery was started, master included.

I wonder if we should not max the firstPause, for example if the master time is 
not synchronized with the region server. Considering the criticallity of this 
code it would be safer imho.




> Initiate lease recovery for outstanding WAL files at the very beginning of 
> recovery
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-10000
>                 URL: https://issues.apache.org/jira/browse/HBASE-10000
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Ted Yu
>            Assignee: Ted Yu
>             Fix For: 0.98.1
>
>         Attachments: 10000-0.96-v5.txt, 10000-recover-ts-with-pb-2.txt, 
> 10000-recover-ts-with-pb-3.txt, 10000-recover-ts-with-pb-4.txt, 
> 10000-recover-ts-with-pb-5.txt, 10000-v4.txt, 10000-v5.txt, 10000-v6.txt
>
>
> At the beginning of recovery, master can send lease recovery requests 
> concurrently for outstanding WAL files using a thread pool.
> Each split worker would first check whether the WAL file it processes is 
> closed.
> Thanks to Nicolas Liochon and Jeffery discussion with whom gave rise to this 
> idea. 



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Commented] (HBASE-10000) Initiate lease recovery for outstanding WAL files at the very beginning of recovery

Reply via email to