[
https://issues.apache.org/jira/browse/HBASE-505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jim Kellerman updated HBASE-505:
--------------------------------
Status: Patch Available (was: Open)
Patch available for 0.1.1
> Region assignments should never time out so long as the region server reports
> that it is processing the open request
> --------------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-505
> URL: https://issues.apache.org/jira/browse/HBASE-505
> Project: Hadoop HBase
> Issue Type: Bug
> Affects Versions: 0.1.0, 0.2.0
> Reporter: Jim Kellerman
> Assignee: stack
> Priority: Blocker
> Fix For: 0.2.0, 0.1.1
>
> Attachments: 505.patch
>
>
> Currently, when the master assigns a region to a region server, it extends
> the reassignment timeout when the region server reports that it is processing
> the open. This only happens once, and so if the region takes a long time to
> come on line due to a large set of transactions in the redo log or because
> the initial compaction takes a long time, the master will assign the region
> to another server when the reassignment timeout occurs.
> Assigning a region to multiple region servers can easily corrupt the region.
> For example:
> region server 1 is processing the redo log creating a new mapfile. It takes
> more than one interval to do so so the master assigns the region to region
> server 2. region server 2 starts processing the redo log creating essentially
> the same mapFile as region server 1, but with a different name.
> region server 2 can fail to open the region if region server 1 deletes the
> old log file or if it tries to open the new mapFile that region server 1 is
> creating.
> region server 1 can fail to open the region if it tries to open the mapFile
> that region server 2 is creating.
> Often region server 1 eventually succeeds and reports to the master that it
> has finished opening the region, but the master tells it to close that region
> because it has assigned it to another server. Region server 2 often fails to
> open the region, because the old log file has been deleted, or it fails to
> process the new map file created by region server 1.
> Proposed solution:
> During the open process the region server should send a MSG_PROCESS_OPEN with
> each heartbeat until the region is opened (when it sends MSG_REGION_OPEN).
> The master will extend the reassignment timeout with each MSG_PROCESS_OPEN it
> receives and will not assign the region to another server so long as it
> continues to receive heart beat messages from the region server processing
> the open.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.