[
https://issues.apache.org/jira/browse/HBASE-505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582454#action_12582454
]
stack commented on HBASE-505:
-----------------------------
Just came across an instance of this. Here is where we are doing the open:
2008-03-25 19:16:43,720 INFO org.apache.hadoop.hbase.HRegionServer:
MSG_REGION_OPEN : enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482
Here is where we failed the open because region was allocated elsewhere even
though this is the server that replayed the edits:
2008-03-25 19:19:56,472 ERROR org.apache.hadoop.hbase.HRegionServer: error
opening region enwiki_080103,iLStZ0yTnfVUziYcNVVxWV==,1205393076482
It took 3 minutes to replay 60k edits.
RegionServer should send a ping back to the master every 10K edits applied.
> Region assignments should never time out so long as the region server reports
> that it is processing the open request
> --------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-505
> URL: https://issues.apache.org/jira/browse/HBASE-505
> Project: Hadoop HBase
> Issue Type: Bug
> Reporter: Jim Kellerman
> Assignee: Jim Kellerman
>
> Currently, when the master assigns a region to a region server, it extends
> the reassignment timeout when the region server reports that it is processing
> the open. This only happens once, and so if the region takes a long time to
> come on line due to a large set of transactions in the redo log or because
> the initial compaction takes a long time, the master will assign the region
> to another server when the reassignment timeout occurs.
> Assigning a region to multiple region servers can easily corrupt the region.
> For example:
> region server 1 is processing the redo log creating a new mapfile. It takes
> more than one interval to do so so the master assigns the region to region
> server 2. region server 2 starts processing the redo log creating essentially
> the same mapFile as region server 1, but with a different name.
> region server 2 can fail to open the region if region server 1 deletes the
> old log file or if it tries to open the new mapFile that region server 1 is
> creating.
> region server 1 can fail to open the region if it tries to open the mapFile
> that region server 2 is creating.
> Often region server 1 eventually succeeds and reports to the master that it
> has finished opening the region, but the master tells it to close that region
> because it has assigned it to another server. Region server 2 often fails to
> open the region, because the old log file has been deleted, or it fails to
> process the new map file created by region server 1.
> Proposed solution:
> During the open process the region server should send a MSG_PROCESS_OPEN with
> each heartbeat until the region is opened (when it sends MSG_REGION_OPEN).
> The master will extend the reassignment timeout with each MSG_PROCESS_OPEN it
> receives and will not assign the region to another server so long as it
> continues to receive heart beat messages from the region server processing
> the open.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.