[jira] [Closed] (HBASE-2755) Duplicate assignment of a region after region server recovery

stack (JIRA) Mon, 21 Mar 2011 18:36:45 -0700

     [ 
https://issues.apache.org/jira/browse/HBASE-2755?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


stack closed HBASE-2755.
------------------------


This was applied a while back.

> Duplicate assignment of a region after region server recovery
> -------------------------------------------------------------
>
>                 Key: HBASE-2755
>                 URL: https://issues.apache.org/jira/browse/HBASE-2755
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>            Reporter: Kannan Muthukkaruppan
>            Assignee: Jonathan Gray
>            Priority: Blocker
>             Fix For: 0.90.0
>
>         Attachments: bs.txt
>
>
> After a region server recovery, some regions may get assigned to duplicate 
> region servers.
> Note: I am based on a slightly older trunk (prior to the HBASE-2694). 
> Nevertheless, I think HBASE-2694 doesn't address this case.
> Scenario:
> * Three region server setup (store285,286,287), with about 500 regions in the 
> table overall.
> * kill -9 and restart one of the region servers (store286).
> * The 170 odd regions in the failed region server got assigned out. But two 
> of the regions got assigned to multiple region servers.
> * Looking at the log entries for one such region, it appears that there is 
> some race condition that happens between the ProcessRegionOpen (a 
> RegionServerOperation) and BaseScanner which causes the BaseScanner to think 
> this region needs to be reassigned.
> Relevant Logs:
> Master detects that the server start message (from the restarted RS) is from 
> a server it already knows about, but startcode is different. So, it triggers 
> server recovery. Alternatively, the recovery will be triggered by ZNODE 
> expiry in some cases depending on which ever event (restart of RS or Znode 
> expiry) happens first. After that it does logs splits etc. for the failed RS; 
> it then also removes the old region server/startcode from the deadservers map.
> {code}
> 2010-06-17 10:26:06,420 INFO org.apache.hadoop.hbase.master.ServerManager: 
> Server start rejected; we already have 10.138.95.182:60020 registered; 
> existingServer=serverName=store286.xyz.com,60020,1276629467680, 
> load=(requests=22, regions=171, usedHeap=6549, maxHeap=11993), 
> newServer=serverName=store286.xyz.com,60020,1276795566511, load=(requests=0, 
> regions=0, usedHeap=0, maxHeap=0)
> 2010-06-17 10:26:06,420 INFO org.apache.hadoop.hbase.master.ServerManager: 
> Triggering server recovery; existingServer looks stale
> 2010-06-17 10:26:06,420 DEBUG org.apache.hadoop.hbase.master.ServerManager: 
> Added=store286.xyz.com,60020,1276629467680 to dead servers, added shutdown 
> processing operation
> ... split log processing...
> 2010-06-17 10:29:51,317 DEBUG 
> org.apache.hadoop.hbase.master.RegionServerOperation: Removed 
> store286.xyz.com,60020,1276629467680 from deadservers Map
> {code}
> What follows is the relevant log snippet for one of the regions that gets 
> double assigned.
> Master tries to assign the region to store285. 
> At 10:30:20,006, in ProcessRegionOpen, we update META with information about 
> the new assignment. However, just around the same time, BaseScanner processes 
> this entry (at 10:30:20,009), but finds that the region is still assigned to 
> the old region server. There have been some fixes for double assignment in 
> BaseScanner because BaseScanner might be doing a stale read depending on when 
> it started. But looks like there is still another hole left.
> {code}
> 2010-06-17 10:30:10,186 INFO org.apache.hadoop.hbase.master.RegionManager: 
> Assigning region 
> test1,9301760000,1276657884012.acff724037d739bab9af61e3edef0cc9. to 
> store285.xyz.com,60020,1276629468460
> 2010-06-17 10:30:11,701 INFO org.apache.hadoop.hbase.master.ServerManager: 
> Processing MSG_REPORT_PROCESS_OPEN: 
> test1,9301760000,1276657884012.acff724037d739bab9af61e3edef0cc9. from 
> store285.xyz.com,60020,1276629468460; 8 of
> 2010-06-17 10:30:12,800 INFO org.apache.hadoop.hbase.master.ServerManager: 
> Processing MSG_REPORT_PROCESS_OPEN: 
> test1,9301760000,1276657884012.acff724037d739bab9af61e3edef0cc9. from 
> store285.xyz.com,60020,1276629468460; 7 of
> 2010-06-17 10:30:13,905 INFO org.apache.hadoop.hbase.master.ServerManager: 
> Processing MSG_REPORT_PROCESS_OPEN: 
> test1,9301760000,1276657884012.acff724037d739bab9af61e3edef0cc9. from 
> store285.xyz.com,60020,1276629468460; 6 of
> ...
> 2010-06-17 10:30:20,001 INFO org.apache.hadoop.hbase.master.ServerManager: 
> Processing MSG_REPORT_OPEN: 
> test1,9301760000,1276657884012.acff724037d739bab9af61e3edef0cc9. from 
> store285.xyz.com,60020,1276629468460; 1 of 3
> 2010-06-17 10:30:20,001 INFO 
> org.apache.hadoop.hbase.master.RegionServerOperation: 
> test1,9301760000,1276657884012.acff724037d739bab9af61e3edef0cc9. open on 
> store285.xyz.com,60020,1276629468460
> 2010-06-17 10:30:20,006 INFO 
> org.apache.hadoop.hbase.master.RegionServerOperation: Updated row 
> test1,9301760000,1276657884012.acff724037d739bab9af61e3edef0cc9. in region 
> .META.,,1 with startcode=1276629468460, server=store285.xyz.com:60020
> 2010-06-17 10:30:20,009 DEBUG org.apache.hadoop.hbase.master.BaseScanner: 
> Current assignment of 
> test1,9301760000,1276657884012.acff724037d739bab9af61e3edef0cc9. is not 
> valid;  serverAddress=store286.xyz.com:60020, startCode=1276629467680 unknown.
> {code}
> At this point BaseScanner calls 
> "this.master.getRegionManager().setUnassigned(info, true)" to set the region 
> to be unassigned (even though it is assigned to store285). And later, this 
> region is given to another region server (store287).
> {code}
> 2010-06-17 10:30:20,581 INFO org.apache.hadoop.hbase.master.RegionManager: 
> Assigning region 
> test1,9301760000,1276657884012.acff724037d739bab9af61e3edef0cc9. to 
> store287.xyz.com,60020,1276629468678
> 2010-06-17 10:30:25,525 INFO org.apache.hadoop.hbase.master.ServerManager: 
> Processing MSG_REPORT_OPEN: 
> test1,9301760000,1276657884012.acff724037d739bab9af61e3edef0cc9. from 
> store287.xyz.com,60020,1276629468678; 6 of 6
> 2010-06-17 10:30:25,531 INFO 
> org.apache.hadoop.hbase.master.RegionServerOperation: 
> test1,9301760000,1276657884012.acff724037d739bab9af61e3edef0cc9. open on 
> store287.xyz.com,60020,1276629468678
> 2010-06-17 10:30:25,534 INFO 
> org.apache.hadoop.hbase.master.RegionServerOperation: Updated row 
> test1,9301760000,1276657884012.acff724037d739bab9af61e3edef0cc9. in region 
> .META.,,1 with startcode=1276629468678, server=store287.xyz.com:60020
> {code}

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Closed] (HBASE-2755) Duplicate assignment of a region after region server recovery

Reply via email to