[jira] Commented: (HADOOP-2660) Regions getting messages from master to MSG_REGION_CLOSE_WITHOUT_REPORT

Billy Pearson (JIRA) Wed, 23 Jan 2008 15:58:58 -0800

    [ 
https://issues.apache.org/jira/browse/HADOOP-2660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12561879#action_12561879
 ]


Billy Pearson commented on HADOOP-2660:
---------------------------------------

Currently I have to set hbase.hbasemaster.maxregionopen = 600000 (10 mins) when 
I have lots of regions about 150-175 over 4 nodes.

I do not have a problem with the loading time until I have something happen and 
we have top rebuild from logs on a restart of the cluster.

What happends is one server over loaded with more then the others with open 
request and it takes quite a long time to load the regions scanning the hlogs 
to rebuild the memcache.
Then we get open rerequest sent to other server to open the same regions and 
they start scanning the hlog to rebuild the memcache also.

At some points I see many regions greater then what it should be open in the 
master gui, then they get the MSG_REGION_CLOSE_WITHOUT_REPORT from the master 
and close out some regionsbecuase they where re assigned again by the master. 
Some times they all do not open correctly and we have to restart again doing 
this all over again or sometimes more then one copy of a region stays open only 
one gets the updates but other still have the region loaded.

lately if I do not know what the count of regions open should be I run a query 
with the shell to select a column that I know is in every region to scan the 
whole table and wait to see if I get a error or not to verfly that all regions 
are open. 

May be we should add the open que to limit the number of pending opens that a 
region can have pending at any point in time. I would suggest we have the 
region server send messages back to the master with the heartbeat messages with 
some of the regions it has open and loaded so that way we can find regions that 
do not get opened for some reasion but the master thanks it open. Maybe send 5 
region per heartbeat that way we can make sure that every region is still alive 
on the region server. The current setup we could have some missed message to 
the master or region server and the master can be confused about what regions 
are open and what regions are not or we should have the clients report to the 
master when a region server returns a error when trying to read or write to a 
region that the region server saids its not servering but the master thank its 
is. Then the master can issue a close command to the region server and re 
assign the region. 

This is just some suggestions to consider as I have seen problems in this area 
where the master thanks one thing and the region server thank something else.

> Regions getting messages from master to MSG_REGION_CLOSE_WITHOUT_REPORT
> -----------------------------------------------------------------------
>
>                 Key: HADOOP-2660
>                 URL: https://issues.apache.org/jira/browse/HADOOP-2660
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: contrib/hbase
>            Reporter: Billy Pearson
>
> I thank we addressed this here
> HADOOP-2295
> but I have found it showing up again
> my hlog size is set to 250,000
> so on a recovery from a failed region server the recovery of scanning the 
> logs takes longer then the 
> hbase.hbasemaster.maxregionopen default of 30 secs
> and the master is thinks the region is open but the region server closes the 
> region when done recovering becuase the master sent a 
> MSG_REGION_CLOSE_WITHOUT_REPORT to the region server.
> I was able to get my table back online completely by adding 
> hbase.hbasemaster.maxregionopen  with a value of 300000 mili secs to my 
> hbase-site.xml file
> and restart.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HADOOP-2660) Regions getting messages from master to MSG_REGION_CLOSE_WITHOUT_REPORT

Reply via email to