Region assignments should never time out so long as the region server reports
that it is processing the open request
--------------------------------------------------------------------------------------------------------------------
Key: HBASE-505
URL: https://issues.apache.org/jira/browse/HBASE-505
Project: Hadoop HBase
Issue Type: Bug
Affects Versions: 0.1.0
Reporter: Jim Kellerman
Assignee: Jim Kellerman
Priority: Blocker
Fix For: 0.1.0
Currently, when the master assigns a region to a region server, it extends the
reassignment timeout when the region server reports that it is processing the
open. This only happens once, and so if the region takes a long time to come on
line due to a large set of transactions in the redo log or because the initial
compaction takes a long time, the master will assign the region to another
server when the reassignment timeout occurs.
Assigning a region to multiple region servers can easily corrupt the region.
For example:
region server 1 is processing the redo log creating a new mapfile. It takes
more than one interval to do so so the master assigns the region to region
server 2. region server 2 starts processing the redo log creating essentially
the same mapFile as region server 1, but with a different name.
region server 2 can fail to open the region if region server 1 deletes the old
log file or if it tries to open the new mapFile that region server 1 is
creating.
region server 1 can fail to open the region if it tries to open the mapFile
that region server 2 is creating.
Often region server 1 eventually succeeds and reports to the master that it has
finished opening the region, but the master tells it to close that region
because it has assigned it to another server. Region server 2 often fails to
open the region, because the old log file has been deleted, or it fails to
process the new map file created by region server 1.
Proposed solution:
During the open process the region server should send a MSG_PROCESS_OPEN with
each heartbeat until the region is opened (when it sends MSG_REGION_OPEN). The
master will extend the reassignment timeout with each MSG_PROCESS_OPEN it
receives and will not assign the region to another server so long as it
continues to receive heart beat messages from the region server processing the
open.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.