[
https://issues.apache.org/jira/browse/HBASE-17264?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Allan Yang updated HBASE-17264:
-------------------------------
Status: Patch Available (was: Open)
> Process RIT with offline state will always fail to open in the first time
> -------------------------------------------------------------------------
>
> Key: HBASE-17264
> URL: https://issues.apache.org/jira/browse/HBASE-17264
> Project: HBase
> Issue Type: Bug
> Components: Region Assignment
> Affects Versions: 1.1.7
> Reporter: Allan Yang
> Assignee: Allan Yang
> Attachments: HBASE-17264-branch1.1.patch
>
>
> In Assignment#processRegionsInTransition, when handling regions with
> M_ZK_REGION_OFFLINE state, we used a handler to reassign this region. But,
> when calling assign, we passed not to set the zk node
> {code}
> case M_ZK_REGION_OFFLINE:
> // Insert in RIT and resend to the regionserver
> regionStates.updateRegionState(rt, State.PENDING_OPEN);
> final RegionState rsOffline = regionStates.getRegionState(regionInfo);
> this.executorService.submit(
> new EventHandler(server, EventType.M_MASTER_RECOVERY) {
> @Override
> public void process() throws IOException {
> ReentrantLock lock =
> locker.acquireLock(regionInfo.getEncodedName());
> try {
> RegionPlan plan = new RegionPlan(regionInfo, null, sn);
> addPlan(encodedName, plan);
> assign(rsOffline, false, false); //we decide to not to
> setOfflineInZK
> } finally {
> lock.unlock();
> }
> }
> });
> break;
> {code}
> But, when setOfflineInZK is false, we passed a zk node vesion of -1 to the
> regionserver, meaning the zk node does not exists. But actually the offline
> zk node does exist with a different version. RegionServer will report fail to
> open because of this.
> This situation is trully happened in our test environment. Though the master
> will recevied the FAILED_OPEN zk event and retry later, but due to a another
> bug(I will open another jira later). The Region will be remain in closed
> state forever.
> Master assign region in RIT
> {noformat}
> 2016-11-23 17:11:46,842 INFO [example.org:30001.activeMasterManager]
> master.AssignmentManager: Processing 57513956a7b671f4e8da1598c2e2970e in
> state: M_ZK_REGION_OFFLINE
> 2016-11-23 17:11:46,842 INFO [example.org:30001.activeMasterManager]
> master.RegionStates: Transition {57513956a7b671f4e8da1598c2e2970e
> state=OFFLINE, ts=1479892306738, server=example.org,30003,1475893095003} to
> {57513956a7b671f4e8da1598c2e2970e state=PENDING_OPEN, ts=1479892306842,
> server=example.org,30003,1479780976834}
> 2016-11-23 17:11:46,842 INFO [example.org:30001.activeMasterManager]
> master.AssignmentManager: Processed region 57513956a7b671f4e8da1598c2e2970e
> in state M_ZK_REGION_OFFLINE, on server: example.org,30003,1479780976834
> 2016-11-23 17:11:46,843 INFO [MASTER_SERVER_OPERATIONS-example.org:30001-0]
> master.AssignmentManager: Assigning
> test,QFO7M,1475986053104.57513956a7b671f4e8da1598c2e2970e. to
> example.org,30003,1479780976834
> {noformat}
> RegionServer recevied the open region request, and new a RegionOpenHandler to
> open the region, but only to find the RIT node's version is not as it
> expected. RS transition the RIT ZK node to failed open in the end
> {noformat}
> 2016-11-23 17:11:46,860 WARN [RS_OPEN_REGION-example.org:30003-1]
> coordination.ZkOpenRegionCoordination: Failed transition from OFFLINE to
> OPENING for region=57513956a7b671f4e8da1598c2e2970e
> 2016-11-23 17:11:46,861 WARN [RS_OPEN_REGION-example.org:30003-1]
> handler.OpenRegionHandler: Region was hijacked? Opening cancelled for
> encodedName=57513956a7b671f4e8da1598c2e2970e
> 2016-11-23 17:11:46,860 WARN [RS_OPEN_REGION-example.org:30003-1]
> zookeeper.ZKAssign: regionserver:30003-0x15810b5f633015f,
> quorum=hbase4dev04.et2sqa:2181,hbase4dev05.et2sqa:2181,hbase4dev06.et2sqa:2181,
> baseZNode=/test-hbase11-func2 Attempt to transition the unassigned node for
> 57513956a7b671f4e8da1598c2e2970e from M_ZK_REGION_OFFLINE to
> RS_ZK_REGION_OPENING failed, the node existed but was version 3 not the
> expected version -1
> {noformat}
> Master recevied this zk event and begin to handle RS_ZK_REGION_FAILED_OPEN
> {noformat}
> 2016-11-23 17:11:46,944 DEBUG [AM.ZK.Worker-pool2-t1]
> master.AssignmentManager: Handling RS_ZK_REGION_FAILED_OPEN,
> server=example.org,30003,1479780976834,
> region=57513956a7b671f4e8da1598c2e2970e,
> current_state={57513956a7b671f4e8da1598c2e2970e state=PENDING_OPEN,
> ts=1479892306843, server=example.org,30003,1479780976834}
> {noformat}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)