[
https://issues.apache.org/jira/browse/HADOOP-2660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12561879#action_12561879
]
Billy Pearson commented on HADOOP-2660:
---------------------------------------
Currently I have to set hbase.hbasemaster.maxregionopen = 600000 (10 mins) when
I have lots of regions about 150-175 over 4 nodes.
I do not have a problem with the loading time until I have something happen and
we have top rebuild from logs on a restart of the cluster.
What happends is one server over loaded with more then the others with open
request and it takes quite a long time to load the regions scanning the hlogs
to rebuild the memcache.
Then we get open rerequest sent to other server to open the same regions and
they start scanning the hlog to rebuild the memcache also.
At some points I see many regions greater then what it should be open in the
master gui, then they get the MSG_REGION_CLOSE_WITHOUT_REPORT from the master
and close out some regionsbecuase they where re assigned again by the master.
Some times they all do not open correctly and we have to restart again doing
this all over again or sometimes more then one copy of a region stays open only
one gets the updates but other still have the region loaded.
lately if I do not know what the count of regions open should be I run a query
with the shell to select a column that I know is in every region to scan the
whole table and wait to see if I get a error or not to verfly that all regions
are open.
May be we should add the open que to limit the number of pending opens that a
region can have pending at any point in time. I would suggest we have the
region server send messages back to the master with the heartbeat messages with
some of the regions it has open and loaded so that way we can find regions that
do not get opened for some reasion but the master thanks it open. Maybe send 5
region per heartbeat that way we can make sure that every region is still alive
on the region server. The current setup we could have some missed message to
the master or region server and the master can be confused about what regions
are open and what regions are not or we should have the clients report to the
master when a region server returns a error when trying to read or write to a
region that the region server saids its not servering but the master thank its
is. Then the master can issue a close command to the region server and re
assign the region.
This is just some suggestions to consider as I have seen problems in this area
where the master thanks one thing and the region server thank something else.
> Regions getting messages from master to MSG_REGION_CLOSE_WITHOUT_REPORT
> -----------------------------------------------------------------------
>
> Key: HADOOP-2660
> URL: https://issues.apache.org/jira/browse/HADOOP-2660
> Project: Hadoop Core
> Issue Type: Bug
> Components: contrib/hbase
> Reporter: Billy Pearson
>
> I thank we addressed this here
> HADOOP-2295
> but I have found it showing up again
> my hlog size is set to 250,000
> so on a recovery from a failed region server the recovery of scanning the
> logs takes longer then the
> hbase.hbasemaster.maxregionopen default of 30 secs
> and the master is thinks the region is open but the region server closes the
> region when done recovering becuase the master sent a
> MSG_REGION_CLOSE_WITHOUT_REPORT to the region server.
> I was able to get my table back online completely by adding
> hbase.hbasemaster.maxregionopen with a value of 300000 mili secs to my
> hbase-site.xml file
> and restart.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.