[ 
https://issues.apache.org/jira/browse/HBASE-505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

stack updated HBASE-505:
------------------------

    Attachment: 505.patch

Here's a first cut.

HBASE-505 Region assignments should never time out so long as the region
server reports that it is processing the open request

Have the HRegionServer pass down a Progressable implementation down into
Region and then down int Store where edits are replayed.  Call progress
after every couple of thousand edits.

M  src/java/org/apache/hadoop/hbase/HStore.java
    Take a Progessable in the constructor.  Call it when applying
    edits.
M  src/java/org/apache/hadoop/hbase/HMaster.java
    Update commment around MSG_REPORT_PROCESS_OPEN so its expected
    that we can get more than one of these messages during a region
    open.
M  src/java/org/apache/hadoop/hbase/HRegion.java
    New constructor that takes a Progressable.  Pass it to Stores on
    construction.
M  src/java/org/apache/hadoop/hbase/HRegionServer.java
    On open of a region, pass in a Progressable that adds a
    MSG_REPORT_PROCESS_OPEN every time its called.

> Region assignments should never time out so long as the region server reports 
> that it is processing the open request
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-505
>                 URL: https://issues.apache.org/jira/browse/HBASE-505
>             Project: Hadoop HBase
>          Issue Type: Bug
>    Affects Versions: 0.2.0
>            Reporter: Jim Kellerman
>            Assignee: Jim Kellerman
>             Fix For: 0.1.1
>
>         Attachments: 505.patch
>
>
> Currently, when the master assigns a region to a region server, it extends 
> the reassignment timeout when the region server reports that it is processing 
> the open. This only happens once, and so if the region takes a long time to 
> come on line due to a large set of transactions in the redo log or because 
> the initial compaction takes a long time, the master will assign the region 
> to another server when the reassignment timeout occurs.
> Assigning a region to multiple region servers can easily corrupt the region. 
> For example:
> region server 1 is processing the redo log creating a new mapfile. It takes 
> more than one interval to do so so the master assigns the region to region 
> server 2. region server 2 starts processing the redo log creating essentially 
> the same mapFile as region server 1, but with a different name. 
> region server 2 can fail to open the region if region server 1 deletes the 
> old log file or if it tries to open the new mapFile that region server 1 is 
> creating.
> region server 1 can fail to open the region if it tries to open the mapFile 
> that region server 2 is creating.
> Often region server 1 eventually succeeds and reports to the master that it 
> has finished opening the region, but the master tells it to close that region 
> because it has assigned it to another server. Region server 2 often fails to 
> open the region, because the old log file has been deleted, or it fails to 
> process the new map file created by region server 1.
> Proposed solution:
> During the open process the region server should send a MSG_PROCESS_OPEN with 
> each heartbeat until the region is opened (when it sends MSG_REGION_OPEN). 
> The master will extend the reassignment timeout with each MSG_PROCESS_OPEN it 
> receives and will not assign the region to another server so long as it 
> continues to receive heart beat messages from the region server processing 
> the open.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to