[ 
https://issues.apache.org/jira/browse/HADOOP-2660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12561859#action_12561859
 ] 

Jim Kellerman commented on HADOOP-2660:
---------------------------------------

> I thank that of two options that would help solve this problem and might need 
> to use both

> option 1
> build in a backlog limit on how many pending opens we can have in any one 
> region server before stop 
> accepting new opens.
> example finding the maximum sequence id for a region takes a lot less time 
> then doing a recovery to a 
> region. So its que would fill up faster making the master send some open 
> request to different servers
> while this one catches up or loop until one of the region servers has open 
> slots in it pending open que. I 
> thank 60 secs is the default loop time so they should be able to hand 10 
> pending opens or something
> like that many be make it an option limit in the conf.

> option 2
>
> 1.Confirm we received the masters open request once we received it
>
> Once confirmed master should not reassign the region to any other region 
> server unless the region
> server goes off line and loses it lease

In fact this exactly what happens today. When a region server receives an open 
region request, it replies
in its next heartbeat to the master with MSG_REPORT_PROCESS_OPEN which means, I 
got your request
and am working on it. When the master receives this message, it adds 
hbase.hbasemaster.maxregionopen (currently 30 seconds) to the amount of time 
before it will try to
assign the region again. If it is taking longer than 30 seconds for a region 
server to open a region,
I would suggest increasing the value of this parameter to 60000 (60 seconds).

> 2 Confirm the open of the region success or failed

When the region server has opened the region, it sends a MSG_REPORT_OPEN to the 
master
meaning that it is now serving the region.

> The master can make sure the region server is still alive by keeping up with 
> heartbeat

It is the region server that sends the heartbeat to the master, but this is 
exactly what happens.



> Regions getting messages from master to MSG_REGION_CLOSE_WITHOUT_REPORT
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-2660
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2660
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hbase
>            Reporter: Billy Pearson
>
> I thank we addressed this here
> HADOOP-2295
> but I have found it showing up again
> my hlog size is set to 250,000
> so on a recovery from a failed region server the recovery of scanning the 
> logs takes longer then the 
> hbase.hbasemaster.maxregionopen default of 30 secs
> and the master is thinks the region is open but the region server closes the 
> region when done recovering becuase the master sent a 
> MSG_REGION_CLOSE_WITHOUT_REPORT to the region server.
> I was able to get my table back online completely by adding 
> hbase.hbasemaster.maxregionopen  with a value of 300000 mili secs to my 
> hbase-site.xml file
> and restart.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to